COVID-19, An Analysis

Joshua Wentling


Introduction

On December 31, 2019, the CDC became aware of a cluster of pneumonia cases in Wuhan, China. These, as we later learned, were the first known cases of COVID-19. By January 30, 2020, WHO declared the COVID-19 outbreak a global health emergency. By mid March, much of the U.S. was in lockdown and many state-wide stay-at-home orders were issued. As of December 2020, the pandemic is growing still with an estimated over 74 million cases worldwide. It has been a rollercoaster of a year, but hopefully with the recent vaccine approvals we will see this pandemic come to an end soon. Until then, people around the world are encouraged to stay at home as much as possible and maintain social distancing measures.

It has been almost a year since the pandemic started. Thanks to the hard work of data scientists, we have data for confirmed cases, deaths, and recoveries for every country around the world for each day since this pandemic started. With that data, we can do a lot of analysis and learn more about how the virus has affected different countries (and different states/provinces within those countries) and possibly even build a model using machine learning based on this data. This tutorial will walk you through the entire data science pipeline using data related to COVID-19, and hopefully by the end of it you will have learned a thing or two.

Table of Contents

This page is broken up into two parts. The first is a more basic analysis on COVID-19 at the global scale and incorporates all the countries around the world that the dataset I used includes. The second part does this same analysis on COVID-19 within the United States on a state level, then it goes into a deeper analysis on the US including the development of a machine learning model. Much of the code from the first part is repeated in the second part, so I will be including less of an explanation for the repeated code to reduce redundancy.

  • Data Import
  • Global
    • Initial Data Processing
    • Graph and Analysis
    • Choropleth
  • United States
    • Graph and Analysis
    • Choropleth
    • Machine Leaning
  • Conclusion

Imports

We will start with our imports. These are all the necessary libraries that we will be using for this tutorial.

In [1]:
# !pip install folium
# !pip install geopandas
import pandas as pd
import folium
from folium import plugins
from folium.plugins import TimeSliderChoropleth
import branca.colormap as cm
import numpy as np
import math
from matplotlib.ticker import FormatStrFormatter
import matplotlib.ticker as mtick
import scipy
from scipy import stats
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

Global

Initial Data Processing

Next, we will import our data from a csv and take a look.

In [2]:
# https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset
df = pd.read_csv('Data/Covid/covid_19_data.csv')

pd.set_option('display.max_rows', None)
df.head()
Out[2]:
SNo ObservationDate Province/State Country/Region Last Update Confirmed Deaths Recovered
0 1 01/22/2020 Anhui Mainland China 1/22/2020 17:00 1.0 0.0 0.0
1 2 01/22/2020 Beijing Mainland China 1/22/2020 17:00 14.0 0.0 0.0
2 3 01/22/2020 Chongqing Mainland China 1/22/2020 17:00 6.0 0.0 0.0
3 4 01/22/2020 Fujian Mainland China 1/22/2020 17:00 1.0 0.0 0.0
4 5 01/22/2020 Gansu Mainland China 1/22/2020 17:00 0.0 0.0 0.0
In [3]:
dftemp = df.copy()
dftemp = dftemp.groupby(['Country/Region', 'ObservationDate'], as_index=False).sum()
dftemp = dftemp.sort_values('Confirmed', ascending=False).drop_duplicates(['Country/Region'])
dftemp.sort_values(by=['Country/Region'], ascending=False).astype(str)
Out[3]:
Country/Region ObservationDate SNo Confirmed Deaths Recovered
53007 occupied Palestinian territory 03/10/2020 4593 25.0 0.0 0.0
53006 Zimbabwe 12/06/2020 171890 10718.0 291.0 8880.0
52744 Zambia 12/06/2020 171889 17916.0 364.0 17173.0
52480 Yemen 12/06/2020 171888 2337.0 639.0 1549.0
52087 Western Sahara 07/07/2020 58571 10.0 1.0 8.0
51993 West Bank and Gaza 12/06/2020 171886 98038.0 828.0 72773.0
51737 Vietnam 12/06/2020 171885 1366.0 35.0 1220.0
51418 Venezuela 12/06/2020 171884 104442.0 919.0 99494.0
51150 Vatican City 03/09/2020 4507 1.0 0.0 0.0
51122 Vanuatu 11/12/2020 153655 1.0 0.0 0.0
51119 Uzbekistan 12/06/2020 171882 74053.0 611.0 71315.0
50852 Uruguay 12/06/2020 171881 7303.0 82.0 5062.0
50584 United Arab Emirates 12/06/2020 171880 176429.0 592.0 159132.0
50271 Ukraine 12/06/2020 4649847 834913.0 14054.0 435575.0
49992 Uganda 12/06/2020 171879 22499.0 206.0 9175.0
49731 US 12/06/2020 9987689 14757000.0 282299.0 5624444.0
49411 UK 12/06/2020 2582103 1727751.0 61342.0 3736.0
49100 Turkey 12/06/2020 171878 828295.0 14900.0 431253.0
48829 Tunisia 12/06/2020 171877 104002.0 3561.0 76441.0
48551 Trinidad and Tobago 12/06/2020 171876 6767.0 122.0 5937.0
48283 Togo 12/06/2020 171875 3095.0 65.0 2653.0
48007 Timor-Leste 12/06/2020 171874 31.0 0.0 30.0
47743 The Gambia 03/17/2020 6429 1.0 0.0 0.0
47739 The Bahamas 03/18/2020 6706 1.0 0.0 0.0
47736 Thailand 12/06/2020 171873 4107.0 60.0 3868.0
47351 Tanzania 10/02/2020 123091 509.0 21.0 183.0
47150 Tajikistan 12/06/2020 171871 12428.0 87.0 11833.0
46929 Taiwan 12/06/2020 171870 716.0 7.0 574.0
46609 Syria 12/06/2020 171869 8403.0 447.0 3943.0
46348 Switzerland 12/05/2020 171108 344497.0 5324.0 260600.0
46062 Sweden 12/05/2020 3601190 278912.0 7067.0 0.0
45752 Suriname 12/06/2020 171867 5324.0 117.0 5210.0
45484 Sudan 12/06/2020 171866 19468.0 1295.0 11021.0
45215 St. Martin 03/09/2020 4412 2.0 0.0 0.0
45214 Sri Lanka 12/06/2020 171865 27877.0 140.0 20460.0
44897 Spain 12/04/2020 3410888 1684647.0 46252.0 150376.0
44589 South Sudan 12/06/2020 171864 3181.0 62.0 2977.0
44343 South Korea 12/06/2020 171863 38161.0 549.0 29301.0
44023 South Africa 12/06/2020 171862 814565.0 22206.0 744780.0
43743 Somalia 12/03/2020 169581 4525.0 121.0 3480.0
43480 Solomon Islands 12/06/2020 171860 17.0 0.0 5.0
43424 Slovenia 12/06/2020 171859 85805.0 1744.0 63580.0
43147 Slovakia 12/06/2020 171858 116731.0 996.0 82666.0
42871 Singapore 12/06/2020 171857 58260.0 29.0 58160.0
42552 Sierra Leone 12/06/2020 171856 2426.0 74.0 1846.0
42301 Seychelles 12/06/2020 171855 184.0 0.0 168.0
42033 Serbia 12/06/2020 171854 219652.0 1949.0 0.0
41757 Senegal 12/06/2020 171853 16477.0 338.0 15776.0
41477 Saudi Arabia 12/06/2020 171852 358713.0 5965.0 348879.0
41195 Sao Tome and Principe 12/04/2020 170331 999.0 17.0 937.0
40951 San Marino 12/05/2020 171090 1789.0 48.0 1412.0
40665 Saint Vincent and the Grenadines 12/03/2020 169569 87.0 0.0 80.0
40400 Saint Lucia 12/06/2020 171848 265.0 2.0 158.0
40131 Saint Kitts and Nevis 12/05/2020 171087 25.0 0.0 20.0
39869 Saint Barthelemy 03/04/2020 3332 3.0 0.0 0.0
39868 Rwanda 12/06/2020 171846 6129.0 51.0 5696.0
39600 Russia 12/06/2020 14293198 2439163.0 42675.0 1920744.0
39289 Romania 12/06/2020 171845 513576.0 12320.0 405612.0
39004 Reunion 03/21/2020 7488 45.0 0.0 0.0
38988 Republic of the Congo 03/16/2020 6147 1.0 0.0 0.0
38987 Republic of Ireland 03/08/2020 4067 21.0 0.0 0.0
38986 Qatar 12/06/2020 171844 139908.0 239.0 137276.0
38699 Puerto Rico 03/16/2020 6107 3.0 0.0 0.0
38698 Portugal 12/06/2020 171843 322474.0 4963.0 243055.0
38418 Poland 12/06/2020 171842 1063449.0 20089.0 706720.0
38140 Philippines 12/06/2020 171841 439834.0 8554.0 408634.0
37827 Peru 12/06/2020 4475797 972688.0 36231.0 907654.0
37551 Paraguay 12/06/2020 171840 87920.0 1853.0 61948.0
37274 Papua New Guinea 12/03/2020 169559 671.0 7.0 597.0
37015 Panama 12/06/2020 171838 177719.0 3193.0 152890.0
36742 Palestine 03/08/2020 4066 22.0 0.0 0.0
36738 Pakistan 12/06/2020 1204838 420294.0 8398.0 356542.0
36448 Others 03/18/2020 6466 712.0 7.0 325.0
36407 Oman 12/06/2020 171837 124886.0 1444.0 116354.0
36120 Norway 12/06/2020 171836 38323.0 354.0 17998.0
35835 North Macedonia 12/06/2020 171835 67968.0 1924.0 43646.0
35550 North Ireland 02/28/2020 2685 1.0 0.0 0.0
35549 Nigeria 12/06/2020 171834 69255.0 1180.0 64774.0
35266 Niger 12/06/2020 171833 1856.0 77.0 1237.0
34999 Nicaragua 12/01/2020 168032 5838.0 161.0 4225.0
34741 New Zealand 12/06/2020 171831 2079.0 25.0 1998.0
34458 Netherlands 12/06/2020 2927200 566628.0 9768.0 7300.0
34174 Nepal 12/06/2020 171830 240981.0 1594.0 225805.0
33857 Namibia 12/06/2020 171829 15078.0 152.0 13939.0
33589 Mozambique 12/06/2020 171828 16244.0 133.0 14416.0
33329 Morocco 12/06/2020 171827 379657.0 6245.0 331301.0
33049 Montenegro 12/06/2020 171826 38246.0 533.0 26999.0
32784 Mongolia 12/06/2020 171825 887.0 0.0 384.0
32512 Monaco 12/06/2020 171824 638.0 3.0 574.0
32230 Moldova 12/06/2020 171823 116365.0 2419.0 101244.0
31956 Mexico 12/06/2020 5682416 1175850.0 109717.0 866186.0
31672 Mayotte 03/21/2020 7544 7.0 0.0 0.0
31662 Mauritius 12/02/2020 168782 508.0 10.0 465.0
31402 Mauritania 12/06/2020 171821 9516.0 188.0 7849.0
31133 Martinique 03/20/2020 7191 32.0 1.0 0.0
31108 Marshall Islands 11/25/2020 163460 4.0 0.0 1.0
31079 Malta 12/06/2020 171819 10520.0 151.0 8396.0
30804 Mali 12/06/2020 171818 5135.0 164.0 3369.0
30547 Maldives 12/06/2020 171817 13198.0 47.0 12321.0
30273 Malaysia 12/06/2020 171816 72694.0 382.0 61273.0
29955 Malawi 12/06/2020 171815 6051.0 185.0 5476.0
29706 Mainland China 12/06/2020 5337728 86634.0 4634.0 81718.0
29385 Madagascar 12/05/2020 171054 17513.0 255.0 16927.0
28994 Macau 07/29/2020 75066 46.0 0.0 46.0
28595 MS Zaandam 05/11/2020 23628 9.0 2.0 0.0
28549 Luxembourg 12/05/2020 171052 37017.0 345.0 28029.0
28268 Lithuania 12/06/2020 171811 74649.0 626.0 30390.0
27985 Liechtenstein 12/06/2020 171810 1396.0 17.0 1183.0
27707 Libya 12/06/2020 171809 86580.0 1231.0 56702.0
27448 Liberia 12/05/2020 171048 1676.0 83.0 1358.0
27183 Lesotho 12/06/2020 171807 2150.0 44.0 1278.0
26975 Lebanon 12/06/2020 171806 137112.0 1099.0 90229.0
26685 Latvia 12/06/2020 171805 21313.0 262.0 1866.0
26400 Laos 12/01/2020 168004 39.0 0.0 26.0
26147 Kyrgyzstan 12/06/2020 171803 75395.0 1297.0 67400.0
25883 Kuwait 12/06/2020 171802 144369.0 891.0 140035.0
25596 Kosovo 12/06/2020 171801 42805.0 1106.0 28504.0
25333 Kenya 12/06/2020 171800 88380.0 1526.0 68929.0
25064 Kazakhstan 12/06/2020 171799 180089.0 2488.0 151755.0
24795 Jordan 12/06/2020 171798 240089.0 3056.0 187444.0
24511 Jersey 03/16/2020 6118 2.0 0.0 0.0
24508 Japan 12/06/2020 8438259 162917.0 2259.0 135121.0
24188 Jamaica 12/06/2020 171797 11120.0 265.0 6998.0
23917 Ivory Coast 12/06/2020 171796 21485.0 132.0 21109.0
23645 Italy 12/06/2020 3616285 1728878.0 60078.0 913494.0
23334 Israel 12/06/2020 171795 344906.0 2917.0 329368.0
23044 Ireland 12/06/2020 171794 74246.0 2099.0 23364.0
22762 Iraq 12/06/2020 171793 564200.0 12432.0 493567.0
22474 Iran 12/06/2020 171792 1040547.0 50310.0 730798.0
22182 Indonesia 12/06/2020 171791 575796.0 17740.0 474771.0
21902 India 12/06/2020 6370326 9677203.0 140573.0 9139901.0
21590 Iceland 12/06/2020 171790 5482.0 27.0 5269.0
21307 Hungary 12/06/2020 171789 250278.0 5868.0 71682.0
21029 Hong Kong 12/06/2020 172083 6897.0 112.0 5567.0
20709 Honduras 12/06/2020 171788 111023.0 2946.0 49456.0
20438 Holy See 12/06/2020 171787 27.0 0.0 15.0
20165 Haiti 12/05/2020 171026 9370.0 233.0 8146.0
19904 Guyana 12/06/2020 171785 5665.0 153.0 4759.0
19634 Guinea-Bissau 12/06/2020 171784 2441.0 44.0 2327.0
19376 Guinea 12/05/2020 171023 13233.0 76.0 12355.0
19103 Guernsey 03/16/2020 6138 1.0 0.0 0.0
19100 Guatemala 12/06/2020 171782 125550.0 4250.0 114315.0
18828 Guam 03/17/2020 6388 3.0 0.0 0.0
18826 Guadeloupe 03/21/2020 7478 53.0 0.0 0.0
18811 Grenada 11/30/2020 167221 41.0 0.0 30.0
18554 Greenland 03/18/2020 6697 1.0 0.0 0.0
18551 Greece 12/06/2020 171780 115471.0 3003.0 23074.0
18262 Gibraltar 03/06/2020 3718 1.0 0.0 0.0
18259 Ghana 12/06/2020 171779 52274.0 325.0 51063.0
17991 Germany 12/06/2020 2927057 1194550.0 18989.0 868285.0
17677 Georgia 12/06/2020 171778 162475.0 1504.0 133511.0
17392 Gambia, The 03/21/2020 7593 1.0 0.0 0.0
17388 Gambia 12/06/2020 171777 3770.0 123.0 3615.0
17126 Gabon 12/04/2020 170256 9254.0 60.0 9106.0
16860 French Guiana 03/21/2020 7518 18.0 0.0 6.0
16846 France 12/06/2020 1893885 2345648.0 55247.0 175220.0
16528 Finland 12/06/2020 171774 27631.0 415.0 18100.0
16214 Fiji 12/05/2020 171013 44.0 2.0 33.0
15952 Faroe Islands 03/10/2020 4672 2.0 0.0 0.0
15945 Ethiopia 12/06/2020 171772 113295.0 1747.0 80831.0
15676 Eswatini 12/06/2020 171771 6501.0 122.0 6090.0
15408 Estonia 12/06/2020 171770 14978.0 131.0 8959.0
15124 Eritrea 12/06/2020 171769 632.0 0.0 508.0
14861 Equatorial Guinea 12/04/2020 170248 5159.0 85.0 5023.0
14596 El Salvador 12/06/2020 171767 40131.0 1159.0 36525.0
14333 Egypt 12/06/2020 171766 118432.0 6771.0 103501.0
14036 Ecuador 12/06/2020 171765 197998.0 13778.0 174188.0
13755 East Timor 03/21/2020 7589 1.0 0.0 0.0
13754 Dominican Republic 12/06/2020 171764 148453.0 2345.0 116873.0
13469 Dominica 12/02/2020 168723 85.0 0.0 63.0
13213 Djibouti 12/06/2020 171762 5701.0 61.0 5593.0
12718 Diamond Princess 04/19/2020 16481 712.0 13.0 644.0
12692 Denmark 12/06/2020 515847 91131.0 885.0 71395.0
12408 Czech Republic 12/06/2020 171759 546833.0 8902.0 478094.0
12127 Cyprus 12/06/2020 171758 12451.0 61.0 2057.0
11853 Curacao 03/14/2020 5605 1.0 0.0 0.0
11852 Cuba 12/06/2020 171757 8782.0 136.0 7963.0
11582 Croatia 12/06/2020 171756 150353.0 2174.0 124439.0
11296 Costa Rica 12/06/2020 171755 143685.0 1773.0 95789.0
11020 Congo (Kinshasa) 12/06/2020 171754 13527.0 344.0 11947.0
10747 Congo (Brazzaville) 12/04/2020 170233 5774.0 94.0 4988.0
10481 Comoros 12/05/2020 170992 616.0 7.0 600.0
10261 Colombia 12/06/2020 5852810 1371103.0 37808.0 1257410.0
9984 Chile 12/06/2020 2926258 560382.0 15628.0 534789.0
9705 Channel Islands 03/10/2020 4686 1.0 0.0 0.0
9704 Chad 12/06/2020 171751 1725.0 102.0 1564.0
9441 Central African Republic 12/06/2020 171750 4922.0 63.0 1924.0
9174 Cayman Islands 03/15/2020 5856 1.0 0.0 0.0
9171 Cape Verde 03/21/2020 7585 1.0 0.0 0.0
9170 Canada 12/06/2020 2755387 418848.0 12688.0 334375.0
8852 Cameroon 12/04/2020 170229 24752.0 443.0 23344.0
8578 Cambodia 12/06/2020 171748 348.0 0.0 306.0
8263 Cabo Verde 12/06/2020 171747 11063.0 109.0 10596.0
8001 Burundi 12/06/2020 171746 694.0 1.0 575.0
7750 Burma 12/06/2020 171745 99155.0 2110.0 77636.0
7495 Burkina Faso 12/06/2020 171744 3212.0 68.0 2711.0
7223 Bulgaria 12/06/2020 171743 161421.0 4797.0 62246.0
6945 Brunei 12/02/2020 168702 151.0 3.0 145.0
6676 Brazil 12/06/2020 4648893 6603540.0 176941.0 5866657.0
6387 Botswana 12/03/2020 169461 11531.0 34.0 8978.0
6138 Bosnia and Herzegovina 12/06/2020 171740 94507.0 2952.0 58625.0
5861 Bolivia 12/06/2020 171739 145560.0 8995.0 124799.0
5590 Bhutan 12/06/2020 171738 430.0 0.0 388.0
5314 Benin 12/06/2020 171737 3055.0 44.0 2907.0
5048 Belize 12/06/2020 171736 7769.0 173.0 3849.0
4789 Belgium 12/06/2020 2066058 591756.0 17320.0 0.0
4482 Belarus 12/06/2020 171735 147157.0 1207.0 124774.0
4199 Barbados 12/06/2020 171734 285.0 7.0 260.0
3934 Bangladesh 12/06/2020 171733 477545.0 6838.0 395960.0
3660 Bahrain 12/06/2020 171732 87930.0 341.0 86030.0
3373 Bahamas, The 03/21/2020 7552 4.0 0.0 0.0
3370 Bahamas 12/06/2020 171731 7570.0 163.0 5995.0
3110 Azerbaijan 12/06/2020 171730 146679.0 1632.0 88497.0
2829 Austria 12/06/2020 171729 303430.0 3840.0 252765.0
2543 Australia 12/06/2020 1378403 27972.0 908.0 25642.0
2225 Aruba 03/19/2020 6951 4.0 0.0 0.0
2218 Armenia 12/06/2020 171728 141937.0 2326.0 116796.0
1937 Argentina 12/06/2020 171727 1463110.0 39770.0 1294692.0
1657 Antigua and Barbuda 12/05/2020 170966 144.0 4.0 133.0
1389 Angola 12/06/2020 171725 15591.0 354.0 8338.0
1127 Andorra 12/06/2020 171724 7050.0 78.0 6238.0
847 Algeria 12/06/2020 171723 88252.0 2516.0 57146.0
561 Albania 12/06/2020 171722 42988.0 905.0 21617.0
288 Afghanistan 12/06/2020 171721 47306.0 1874.0 37685.0
1 ('St. Martin',) 03/10/2020 4675 2.0 0.0 0.0
0 Azerbaijan 02/28/2020 2664 1.0 0.0 0.0

Graph and Analysis

Let's start by making a line graph of cases over time.

We'll make a copy of the original dataframe so that we can make edits for our line graph. The original dataset breaks the cases down by province/state. For this graph, we will just be looking at the data per country. To account for this, we will sum the data by country and oberservation date.

In [4]:
dfl = df.copy()
# Get the sum of all confirmed cases in every country for each day
dfl = dfl.groupby(['ObservationDate'], as_index=False).sum()
# Remove all unnecessary data
dfl = dfl[['ObservationDate','Confirmed']]
dfl.head()
Out[4]:
ObservationDate Confirmed
0 01/22/2020 555.0
1 01/23/2020 653.0
2 01/24/2020 941.0
3 01/25/2020 1438.0
4 01/26/2020 2118.0
In [5]:
# Convert the observation date strings to a pd datetime
dfl['ObservationDate'] = dfl['ObservationDate'].apply(pd.to_datetime)
# Adjust the scale of case numbers to millions
dfl['Confirmed'] = dfl['Confirmed']/1000000
# Makes each column a different date for plotting
dfl = dfl.set_index('ObservationDate')
In [6]:
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10,5)

ax = dfl.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Number of Cases')
ax.set_title('Total Confirmed Cases Over Time')

formatter = FormatStrFormatter('%0.0fM')
ax.yaxis.set_major_formatter(formatter) 

Let's choose a few specific countries to see how their confirmed cases compare

In [7]:
dftemp = df.copy()
dftemp = dftemp.groupby(['Country/Region', 'ObservationDate'], as_index=False).sum()
dftemp = dftemp.sort_values('Confirmed', ascending=False).drop_duplicates(['Country/Region'])
dftemp = dftemp.sort_values(by=['Confirmed'], ascending=False).astype(str).head(7)
dftemp
Out[7]:
Country/Region ObservationDate SNo Confirmed Deaths Recovered
49731 US 12/06/2020 9987689 14757000.0 282299.0 5624444.0
21902 India 12/06/2020 6370326 9677203.0 140573.0 9139901.0
6676 Brazil 12/06/2020 4648893 6603540.0 176941.0 5866657.0
39600 Russia 12/06/2020 14293198 2439163.0 42675.0 1920744.0
16846 France 12/06/2020 1893885 2345648.0 55247.0 175220.0
23645 Italy 12/06/2020 3616285 1728878.0 60078.0 913494.0
49411 UK 12/06/2020 2582103 1727751.0 61342.0 3736.0
In [8]:
dfl = df.copy()

# Select the countries we want to see
country_list = dftemp['Country/Region'].tolist()
dfl = dfl.loc[dfl['Country/Region'].isin(country_list)]


# Repeat steps from last line graph
dfl = dfl.groupby(['Country/Region', 'ObservationDate'], as_index=False).sum()

# Remove all unnecessary data
df_confirmed = dfl.copy()
df_confirmed = df_confirmed[['Country/Region','ObservationDate','Confirmed']]
# Convert the observation date strings to a pd datetime
df_confirmed['ObservationDate'] = df_confirmed['ObservationDate'].apply(pd.to_datetime)
# Adjust the scale of case numbers to millions
df_confirmed['Confirmed'] = df_confirmed['Confirmed']/1000000
# Adjust dataframe for plotting
df_confirmed = df_confirmed.pivot(index='ObservationDate', columns='Country/Region', values='Confirmed')

plt.rcParams["figure.figsize"] = (10,10)

ax = df_confirmed.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Number of Cases')
ax.set_title('Total Confirmed Cases Over Time')

formatter = FormatStrFormatter('%0.0fM')
ax.yaxis.set_major_formatter(formatter)

Let's take a look at the same graph but for deaths instead of total cases.

In [9]:
df_deaths = dfl.copy()
df_deaths = df_deaths[['Country/Region','ObservationDate','Deaths']]
# Convert the observation date strings to a pd datetime
df_deaths['ObservationDate'] = df_deaths['ObservationDate'].apply(pd.to_datetime)
# Adjust the scale of case numbers to millions
df_deaths['Deaths'] = df_deaths['Deaths']/1000
# Adjust dataframe for plotting
df_deaths = df_deaths.pivot(index='ObservationDate', columns='Country/Region', values='Deaths')

plt.rcParams["figure.figsize"] = (10,10)

ax = df_deaths.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Number of Cases')
ax.set_title('Total Deaths Over Time')

formatter = FormatStrFormatter('%0.0fK')
ax.yaxis.set_major_formatter(formatter)

While this graph is interesting, it doesn't tell us a whole lot about how each of these countries have handled the outbreak overall, since a country with a smaller population is far less likely to make it onto this list, even if a large percentage of the population was infected. Let's redo this analysis using cases as a percentage of the country's total population.

In [10]:
# Data from: https://population.un.org/wpp/Download/Standard/CSV/
df_pop = pd.read_csv('Data/WPP2019_TotalPopulationBySex.csv')
# We only want data for 2020
df_pop = df_pop[df_pop['Time'] == 2020]
# We want just one row per country
df_pop = df_pop.groupby(['Location', 'Time'], as_index=False).mean()
# Get rid of unnecessary columns
df_pop = df_pop[['Location','PopTotal']]
# Population is in thousands, let's fix that
df_pop['PopTotal'] = df_pop['PopTotal']*1000

# Rename 'Location' to 'Country/Region' so the merge works properly
df_pop = df_pop.rename(columns={'Location': 'Country/Region'})

Now we have a list of countries and population totals for the year 2020! Next we will want to merge this with our existing dataframe. We will use a left merge on the existing dataframe (since that's the most important one) and then take a look to see if we're missing any population data.

First we must see if there is any missing data. Using the code below, we will display only the countries whose population total is NaN, meaning that it did not merge properly with our population data.

In [11]:
# We will not be keeping this dataframe, hence why I used 'dftemp'
dftemp = df.copy()
# Left merge because we only care about countries that we have covid data for
dftemp = dftemp.merge(df_pop, on='Country/Region', how='left')
dftemp = dftemp[np.isnan(dftemp['PopTotal'])]
dftemp.groupby(['Country/Region'], as_index=False).mean()
Out[11]:
Country/Region SNo Confirmed Deaths Recovered PopTotal
0 Azerbaijan 2664.000000 1.000000 0.000000 0.000000 NaN
1 ('St. Martin',) 4675.000000 2.000000 0.000000 0.000000 NaN
2 Bahamas, The 7253.666667 3.333333 0.000000 0.000000 NaN
3 Bolivia 76018.431734 69601.712177 3722.612546 43647.557196 NaN
4 Brunei 75495.948718 137.062271 2.278388 124.380952 NaN
5 Burma 80369.188235 15417.839216 345.650980 10150.203922 NaN
6 Cape Verde 7585.000000 1.000000 0.000000 0.000000 NaN
7 Congo (Brazzaville) 77094.973783 2761.700375 52.258427 1728.322097 NaN
8 Congo (Kinshasa) 76033.948339 6831.221402 176.136531 5322.457565 NaN
9 Curacao 5733.000000 1.000000 0.000000 0.000000 NaN
10 Czech Republic 73456.234875 90518.960854 1229.569395 59609.946619 NaN
11 Diamond Princess 79827.649805 712.000000 12.805447 649.626459 NaN
12 East Timor 7589.000000 1.000000 0.000000 0.000000 NaN
13 Gambia, The 7141.750000 1.000000 0.000000 0.000000 NaN
14 Guernsey 6587.625000 0.375000 0.000000 0.000000 NaN
15 Hong Kong 64881.081250 2548.284375 40.859375 2207.271875 NaN
16 Iran 70799.890411 305572.263699 16748.715753 241092.914384 NaN
17 Ivory Coast 75796.937500 11618.933824 75.257353 9925.702206 NaN
18 Jersey 6579.000000 0.750000 0.000000 0.000000 NaN
19 Kosovo 78191.965779 10219.840304 332.965779 7180.281369 NaN
20 Laos 79594.158915 21.120155 0.000000 17.065891 NaN
21 MS Zaandam 80718.622047 8.889764 1.968504 0.110236 NaN
22 Macau 64943.078125 38.390625 0.000000 34.128125 NaN
23 Mainland China 64959.045170 2546.921053 129.388788 2241.340895 NaN
24 Moldova 75313.996350 32699.087591 829.390511 24030.485401 NaN
25 North Ireland 2685.000000 1.000000 0.000000 0.000000 NaN
26 Others 3427.391304 570.173913 4.108696 108.760870 NaN
27 Palestine 3870.800000 17.200000 0.000000 0.000000 NaN
28 Republic of Ireland 4067.000000 21.000000 0.000000 0.000000 NaN
29 Republic of the Congo 6873.833333 0.166667 0.000000 0.000000 NaN
30 Reunion 6116.272727 12.454545 0.000000 0.000000 NaN
31 Russia 101297.522550 14182.890758 234.722563 10417.830793 NaN
32 Saint Barthelemy 3953.571429 2.428571 0.000000 0.000000 NaN
33 South Korea 64732.831250 14912.893750 272.403125 12406.787500 NaN
34 St. Martin 4412.000000 2.000000 0.000000 0.000000 NaN
35 Syria 79107.211538 2273.242308 110.480769 766.634615 NaN
36 Taiwan 64744.350000 394.481250 5.512500 334.659375 NaN
37 Tanzania 77479.751880 438.026316 17.883459 153.703008 NaN
38 The Bahamas 6871.333333 0.500000 0.000000 0.000000 NaN
39 The Gambia 7019.000000 0.200000 0.000000 0.000000 NaN
40 UK 83280.639692 33870.759427 2761.067162 120.507294 NaN
41 US 72297.957352 76987.073582 2314.253304 27325.505106 NaN
42 Vatican City 4124.000000 1.000000 0.000000 0.000000 NaN
43 Venezuela 76953.477612 35674.694030 305.884328 30695.104478 NaN
44 Vietnam 64960.526646 576.670846 12.642633 470.050157 NaN
45 West Bank and Gaza 80231.511719 22473.257812 183.210938 17020.285156 NaN
46 occupied Palestinian territory 5543.285714 3.571429 0.000000 0.000000 NaN

Using this list, let's fill in as much of the missing data as we can. Unfortunately, we will not have data for every single country. However, smaller countries with little or no data likely won't have an effect on the analysis we are doing.

In [12]:
df_pop = df_pop.replace({'Country/Region':'Russian Federation'},'Russia')
df_pop = df_pop.replace({'Country/Region':'United States of America'},'US')
df_pop = df_pop.replace({'Country/Region':'Iran (Islamic Republic of)'},'Iran')
df_pop = df_pop.replace({'Country/Region':'Bahamas'},'Bahamas, The')
df_pop = df_pop.replace({'Country/Region':'Azerbaijan'},'Azerbaijan')
df_pop = df_pop.replace({'Country/Region':'United Kingdom'},'UK')
df_pop = df_pop.replace({'Country/Region':'Bolivia (Plurinational State of)'},'Bolivia')
df_pop = df_pop.replace({'Country/Region':'Brunei Darussalam'},'Brunei')
df_pop = df_pop.replace({'Country/Region':'Democratic Republic of the Congo'},'Congo (Kinshasa)')
df_pop = df_pop.replace({'Country/Region':'Congo'},'Congo (Brazzaville)')
df_pop = df_pop.replace({'Country/Region':'Curaçao'},'Curacao')
df_pop = df_pop.replace({'Country/Region':'Czechia'},'Czech Republic')
df_pop = df_pop.replace({'Country/Region':'Timor-Leste'},'East Timor')
df_pop = df_pop.replace({'Country/Region':'China, Hong Kong SAR'},'Hong Kong')
df_pop = df_pop.replace({'Country/Region':'China'},'Mainland China')
df_pop = df_pop.replace({'Country/Region':'Republic of Korea'},'South Korea')
df_pop = df_pop.replace({'Country/Region':'China, Taiwan Province of China'},'Taiwan')
df_pop = df_pop.replace({'Country/Region':'United Republic of Tanzania'},'Tanzania')
df_pop = df_pop.replace({'Country/Region':'Venezuela (Bolivarian Republic of)'},'Venezuela')

Then we can run our merge code again.

In [13]:
df_merge = df.copy()
df_merge = df_merge.merge(df_pop, on='Country/Region', how='left')

Finally, let's add a column that is confirmed cases divided by the population total to get us a confirmed percentage for each country.

In [14]:
df_merge['ConfirmedPercentage'] = df_merge['Confirmed']/df_merge['PopTotal']*100
df_merge.head()
Out[14]:
SNo ObservationDate Province/State Country/Region Last Update Confirmed Deaths Recovered PopTotal ConfirmedPercentage
0 1 01/22/2020 Anhui Mainland China 1/22/2020 17:00 1.0 0.0 0.0 1.439324e+09 6.947707e-08
1 2 01/22/2020 Beijing Mainland China 1/22/2020 17:00 14.0 0.0 0.0 1.439324e+09 9.726790e-07
2 3 01/22/2020 Chongqing Mainland China 1/22/2020 17:00 6.0 0.0 0.0 1.439324e+09 4.168624e-07
3 4 01/22/2020 Fujian Mainland China 1/22/2020 17:00 1.0 0.0 0.0 1.439324e+09 6.947707e-08
4 5 01/22/2020 Gansu Mainland China 1/22/2020 17:00 0.0 0.0 0.0 1.439324e+09 0.000000e+00

Confirmed Cases as Percentage of Population

Now let's do some analysis using our new ConfirmedPercentage category.

We'll start with another line graph. Let's see if our top 5 countries are the same using ConfirmedPercentage instead of Confirmed.

In [15]:
dftemp = df_merge.copy()
dftemp = dftemp.groupby(['Country/Region', 'ObservationDate'], as_index=False).sum()
dftemp = dftemp.sort_values('ConfirmedPercentage', ascending=False).drop_duplicates(['Country/Region'])
dftemp = dftemp.sort_values(by=['ConfirmedPercentage'], ascending=False).astype(str).head(7)
dftemp
Out[15]:
Country/Region ObservationDate SNo Confirmed Deaths Recovered PopTotal ConfirmedPercentage
1127 Andorra 12/06/2020 171724 7050.0 78.0 6238.0 77265.0 9.1244418559503
33049 Montenegro 12/06/2020 171826 38246.0 533.0 26999.0 628062.0 6.089526193273912
28550 Luxembourg 12/06/2020 171812 37017.0 345.0 28029.0 625975.9999999999 5.913485501041574
40951 San Marino 12/05/2020 171090 1789.0 48.0 1412.0 33937.99999999999 5.271377217278568
3660 Bahrain 12/06/2020 171732 87930.0 341.0 86030.0 1701582.9999999998 5.1675410485412705
12408 Czech Republic 12/06/2020 171759 546833.0 8902.0 478094.0 10708982.000000002 5.106302354416133
4789 Belgium 12/06/2020 2066058 591756.0 17320.0 0.0 139075392.00000003 5.1059155022910145
In [16]:
import matplotlib.ticker as mtick

dfl = df_merge.copy()

# Select the countries we want to see
country_list = dftemp['Country/Region'].tolist()
dfl = dfl.loc[dfl['Country/Region'].isin(country_list)]

# Repeat steps from last line graph
dfl = dfl.groupby(['Country/Region', 'ObservationDate'], as_index=False).sum()

# Remove all unnecessary data
df_confirmed = dfl.copy()
df_confirmed = df_confirmed[['Country/Region','ObservationDate','ConfirmedPercentage']]
# Convert the observation date strings to a pd datetime
df_confirmed['ObservationDate'] = df_confirmed['ObservationDate'].apply(pd.to_datetime)
# Adjust the scale of case numbers to millions
# Adjust dataframe for plotting
df_confirmed = df_confirmed.pivot(index='ObservationDate', columns='Country/Region', values='ConfirmedPercentage')

plt.rcParams["figure.figsize"] = (10,10)

ax = df_confirmed.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Percentage of Population')
ax.set_title('Confirmed Cases As Percentage of Population Over Time')

formatter = mtick.PercentFormatter()
ax.yaxis.set_major_formatter(formatter)

This change to percentage of population gives us a bit more perspective. While the US is still in the top 10, it shows that their are a number of other countries with worse case rates relative to their total population. Andorra has a signficantly higher percentage than any other country at around 9%.


Choropleth

Next, we will be making a time-based choropleth to illustrate the spread of the virus over time. There are a few different use cases and ways we can display our data using a time slider choropleth. To start, we will use it to show the overall spread of the virus over time from its beginning last January 2020 until early December of 2020.

I used the following website as a reference to better understand TimeSliderChoropleth. Some of its parameters such as styledict, have very little documentation. From: https://www.jumpingrivers.com/blog/interactive-maps-python-covid-19-spread/

We will start by simply making a copy of the dataframe so we can make modification to it specific to the time slider choropleth.

In [17]:
# dfc: Data frame choropleth
dfc = df.copy()

Some countries have different labels than our geography database (which we will get to below). We must fix these labels so that they match.

In [18]:
dfc = dfc.replace({'Country/Region':'US'},'United States of America')
dfc = dfc.replace({'Country/Region':'Mainland China'},'China')
dfc = dfc.replace({'Country/Region':'UK'},'United Kingdom')
dfc = dfc.replace({'Country/Region':'Republic of the Congo'},'Congo (Brazzaville)')

Some countries are broken down by province/state. This means that the data is dispersed throughout these regions. Our geographic data only accounts for countries, so we will have to sum the data for the country as a whole.

In [19]:
dfc = dfc.groupby(['Country/Region', 'ObservationDate'], as_index=False).sum()
In [20]:
# Convert time to proper format for TimeSliderChoropleth
dfc['ObservationDate'] = pd.to_datetime(dfc['ObservationDate']).astype(int) / 10**9

# Take log of data since some countries have significantly high populations than others
dfc['Confirmed Adjusted'] = dfc['Confirmed'].apply(lambda x: math.log10(1+x)) # Add 1 to account for zeros

Next we need to calculate the color data for the TimeSliderChoropleth. This color map will evenly distribute a range of color hex's across a dataframe column based on their values. In this case, the color map is being applied to the number of confirmed cases. The specific color map used doesn't affect the data, just the colors used in the map.

In [21]:
max_color = max(dfc['Confirmed Adjusted'])
min_color = min(dfc['Confirmed Adjusted'])
cmap = cm.linear.YlOrRd_09.scale(min_color, max_color)
dfc['Color'] = dfc['Confirmed Adjusted'].apply(cmap)

First we must get the geography data for all countries. This is done using the geopandas dataset. Found here: https://nbviewer.jupyter.org/github/python-visualization/folium/blob/master/examples/TimeSliderChoropleth.ipynb

In [22]:
import geopandas as gpd
assert "naturalearth_lowres" in gpd.datasets.available
datapath = gpd.datasets.get_path("naturalearth_lowres")
gdf = gpd.read_file(datapath)

Next, we must create a style dictionary to store the color and opacity values for each date for each country. This is stored as a dictionary of dictionaries. Each entry in the dictionary correlates to a country and then the inner dictionary contains color values for each date.

A styledict is defined as: A dictionary where the keys are the geojson feature ids and the values are dicts of {time: style_options_dict}

In [23]:
styledict = {}

for index, row in gdf.iterrows():
    cdata = dfc[dfc['Country/Region'] == row['name']]
    temp_dict = {}
    
    for i, r in cdata.iterrows():
        temp_dict[str(r['ObservationDate'])] = {'color': r['Color'], 'opacity': 0.8}

    styledict[str(index)] = temp_dict

Finally, we need to create the folium map. Note that I was unable to map the data properly for a couple countries. Those countries will stay white even after adjusting the slider.

In [24]:
map_osm = folium.Map(min_zoom=2, max_bounds=True)

g = TimeSliderChoropleth(
    data=gdf.to_json(),
    styledict=styledict,
).add_to(map_osm)

_ = cmap.add_to(map_osm)

map_osm
Out[24]:
Make this Notebook Trusted to load map: File -> Trust Notebook

United States

As a US citizen who is much more familiar with the scope of political, legal, and cultural differences in different areas throughout the country, I would love to take a look at how COVID-19 has affected different states within the US. This analysis could easily be done using state or provincial data in another country. For this tutorial, I will use the US. I also believe the US is a great country to do this on because there are vast changes in population density, climate, and culture from state to state. I would love to do some analysis to see if we can find any statistically significant correlation between any of these factors and the rates of cases for COVID-19. Let's get started!

Graph and Analysis

First we will need to create a dataframe with just the data for each US state.

In [25]:
df_us = df.copy()
df_us = df_us[df_us['Country/Region']=='US']

Next, we will repeat the steps above to create a line graph for the top worst states in terms of cases. I won't include explantations since we have already done all this above.

In [26]:
dftemp = df_us.copy()
dftemp = dftemp.groupby(['Province/State', 'ObservationDate'], as_index=False).sum()
dftemp = dftemp.sort_values('Confirmed', ascending=False).drop_duplicates(['Province/State'])
dftemp = dftemp.sort_values(by=['Confirmed'], ascending=False).astype(str).head(7)
In [27]:
dfl = df_us.copy()

# Repeat steps from last line graph
dfl = dfl.groupby(['Province/State', 'ObservationDate'], as_index=False).sum()

# Save current dataframe for later
df_tot = dfl.copy()

# Select the countries we want to see
state_list = dftemp['Province/State'].tolist()
dfl = dfl.loc[dfl['Province/State'].isin(state_list)]

# Remove all unnecessary data
df_confirmed = dfl.copy()

df_confirmed = df_confirmed[['Province/State','ObservationDate','Confirmed']]
# Convert the observation date strings to a pd datetime
df_confirmed['ObservationDate'] = df_confirmed['ObservationDate'].apply(pd.to_datetime)
# Adjust the scale of case numbers to millions
df_confirmed['Confirmed'] = df_confirmed['Confirmed']/1000000
# Adjust dataframe for plotting
df_confirmed = df_confirmed.pivot(index='ObservationDate', columns='Province/State', values='Confirmed')

plt.rcParams["figure.figsize"] = (10,10)

ax = df_confirmed.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Number of Cases')
ax.set_title('Total Confirmed Cases Over Time')

formatter = FormatStrFormatter('%0.0fM')
ax.yaxis.set_major_formatter(formatter)

Unsurprisingly, some of the US's most populous states have the most cases. While this is interesting to see graphically, I would be more interested in seeing how the different states compare relative to each other. I believe the best way to do this is to look at the proportion of people in each state who tested positive for COVID-19. To do this, we can import and merge total population data for each state and calculate the percentage of people in each state who have been infected.

In [28]:
# https://population.un.org/wpp/Download/Standard/CSV/
df_pop = pd.read_csv('Data/State Populations.csv')

# Rename 'Location' to 'Country/Region' so the merge works properly
df_pop = df_pop.rename(columns={'State': 'Province/State'})
df_pop = df_pop.rename(columns={'2018 Population': 'PopTotal'})

df_pop.head(10)
Out[28]:
Province/State PopTotal
0 California 39776830
1 Texas 28704330
2 Florida 21312211
3 New York 19862512
4 Pennsylvania 12823989
5 Illinois 12768320
6 Ohio 11694664
7 Georgia 10545138
8 North Carolina 10390149
9 Michigan 9991177

Now that we have the data organized nicely, let's merge this with our existing data and calculated the confirmed percentage.

In [29]:
df_merge = df_us.copy()
df_merge = df_merge.merge(df_pop, on='Province/State', how='inner')

df_merge['ConfirmedPercentage'] = df_merge['Confirmed']/df_merge['PopTotal']*100
df_merge.tail(10)
Out[29]:
SNo ObservationDate Province/State Country/Region Last Update Confirmed Deaths Recovered PopTotal ConfirmedPercentage
13887 165057 11/27/2020 Alabama US 2020-11-28 05:25:50 242874.0 3572.0 0.0 4888949 4.967816
13888 165817 11/28/2020 Alabama US 2020-11-29 05:25:55 244993.0 3572.0 0.0 4888949 5.011159
13889 166577 11/29/2020 Alabama US 2020-11-30 05:26:13 247229.0 3577.0 0.0 4888949 5.056895
13890 167337 11/30/2020 Alabama US 2020-12-01 05:26:18 249524.0 3578.0 0.0 4888949 5.103837
13891 168097 12/01/2020 Alabama US 2020-12-02 05:27:41 252900.0 3638.0 0.0 4888949 5.172891
13892 168857 12/02/2020 Alabama US 2020-12-03 05:28:22 256828.0 3711.0 0.0 4888949 5.253235
13893 169617 12/03/2020 Alabama US 2020-12-04 05:27:27 260359.0 3776.0 0.0 4888949 5.325460
13894 170377 12/04/2020 Alabama US 2020-12-05 05:27:36 264199.0 3831.0 0.0 4888949 5.404004
13895 171137 12/05/2020 Alabama US 2020-12-06 05:26:18 267589.0 3877.0 0.0 4888949 5.473344
13896 171897 12/06/2020 Alabama US 2020-12-07 05:26:14 269877.0 3889.0 0.0 4888949 5.520143

Perfect. Since we only need the total number of confirmed cases for this analysis, we will drop all rows aside from the most recent one (since the number of confirmed cases in the table is cumulative).

In [30]:
dftemp = df_merge.copy()
# dftemp = dftemp.groupby(['Province/State', 'ObservationDate'], as_index=False).mean()
dftemp = dftemp.sort_values('ConfirmedPercentage', ascending=False).drop_duplicates(['Province/State'])
dftemp = dftemp.sort_values(by=['ConfirmedPercentage'], ascending=False).astype(str).head(7)

Now, we can run the code to create a table using our new confirmed percentage column to see which states have the highest percentage of covid cases.

In [31]:
# dfl: dataframe line (graph)
dfl = df_merge.copy()

# Select the countries we want to see
country_list = dftemp['Province/State'].tolist()
# dfl = dfl.loc[dfl['Province/State'].isin(country_list)]


# Repeat steps from last line graph
dfl = dfl.groupby(['Province/State', 'ObservationDate'], as_index=False).sum()

# Remove all unnecessary data
df_confirmed = dfl.copy()
df_confirmed = df_confirmed[['Province/State','ObservationDate','ConfirmedPercentage']]

# Convert the observation date strings to a pd datetime
df_confirmed['ObservationDate'] = df_confirmed['ObservationDate'].apply(pd.to_datetime)
# Adjust dataframe for plotting
df_plot = df_confirmed.copy()
df_plot = df_plot.loc[dfl['Province/State'].isin(country_list)]
df_plot = df_plot.pivot(index='ObservationDate', columns='Province/State', values='ConfirmedPercentage')

plt.rcParams["figure.figsize"] = (10,10)

ax = df_plot.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Percentage of Population')
ax.set_title('Confirmed Cases As Percentage of Population Over Time')

formatter = mtick.PercentFormatter()
ax.yaxis.set_major_formatter(formatter)

This is much more interesting! This new graph shows the top 7 states with the highest percentages of covid cases. This is very different from our original graph which showcased the overall number of cases.


Choropleth

These steps are the same as last time with a slight modification for the US, so I won't be including all the explanations.

In [32]:
# dfc: Data frame choropleth
dfc = df.copy()
In [33]:
dfc = dfc.replace({'Country/Region':'US'},'United States of America')
dfc = dfc[dfc['Country/Region'] == 'United States of America']
In [34]:
# Convert time to proper format for TimeSliderChoropleth
dfc['ObservationDate'] = pd.to_datetime(dfc['ObservationDate']).astype(int) / 10**9

# Take log of data since some countries have significantly high populations than others
dfc['Confirmed Adjusted'] = dfc['Confirmed'].apply(lambda x: math.log10(1+x)) # Add 1 to account for zeros
In [35]:
max_color = max(dfc['Confirmed Adjusted'])
min_color = min(dfc['Confirmed Adjusted'])
cmap = cm.linear.YlOrRd_09.scale(min_color, max_color)
dfc['Color'] = dfc['Confirmed Adjusted'].apply(cmap)
In [36]:
state_path = 'Data/Country_Geo/us-states.json'
state_data = gpd.read_file(state_path)
In [37]:
styledict = {}

for index, row in state_data.iterrows():
    cdata = dfc[dfc['Province/State'] == row['name']]
    temp_dict = {}
    
    for i, r in cdata.iterrows():
        temp_dict[str(r['ObservationDate'])] = {'color': r['Color'], 'opacity': 0.8}

    styledict[str(index)] = temp_dict
In [38]:
map_osm = folium.Map(location=[48, -102], zoom_start=3, min_zoom=2, max_bounds=True)

g = TimeSliderChoropleth(
    data=state_data.to_json(),
    styledict=styledict,
).add_to(map_osm)

_ = cmap.add_to(map_osm)

map_osm
Out[38]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Machine Learning

For the last part of this tutorial, let's develop a machine learning model to see if we can use a variety of different factors to accurately predict the amount of covid cases in each state. For this tutorial, we will be doing multiple linear regression model with statsmodel.

Finding Good Variables

In order to create an effective model, we must first gather some more data unique to each state that we believe might have an impact on its rates of COVID-19. There are a number of different possible datasets that could provide useful for this machine learning model. I have researched and selected datasets that I believed would be the most useful, however, this could certainly be expanded on.

Political Ideology

The COVID-19 pandemic has caused some significant and rapid changes in the way that we live our lives in order to be as safe as possible. Along with this has come some polarizing opinions between people about the best way to handle the pandemic. Some are in favor of increasing restrictions on businesses and individuals to protect the more vulnerable members of society as well as reduce load on our hospitals. Others argue that these restrictions starve small businesses and lower income individuals who need to continue working in order to survive. These differences in opinions have resulted in the politicizing of policy related to the pandemic and has also impacted individual behavior in regards to the utilization of masks and respect for social distancing.

Because of this, I believe that political ideologies could have an impact on the rates of covid from state to state. I found a dataset that scores each state based on the average political views of its citizens as well as its politicians. After running an analysis using each of these scores, I found that the scoring for the individual political ideology to be more useful so I will be using that number in this analysis.

In [39]:
# .dta file meant for stata program; luckily pandas can read this
df_partisan = pd.io.stata.read_stata('Data/stateideology_v2018.dta')

# Let's get an average from the past 5 years
years = [2013,2014,2015,2016,2017]
df_partisan = df_partisan[df_partisan['year'].isin(years)]

# For some reason, 2014 and 2015 are missing the state names
# This code fills in that missing data
df_partisan = df_partisan.replace(r'^\s*$', np.nan, regex=True)
df_partisan = df_partisan.fillna(method='ffill')

# Let's sum the values for the last 6 years
df_partisan = df_partisan.groupby(['statename'], as_index=False).mean()
df_partisan.sort_values('citi6016', ascending=True)

# Let's also make sure the columns align
df_partisan = df_partisan.rename(columns={'statename': 'Province/State'})
# And let's rename the column to something more meaningful
df_partisan = df_partisan.rename(columns={'citi6016': 'CitizenIdeology'})
df_partisan = df_partisan.rename(columns={'inst6017_nom': 'GovernmentIdeology'})
In [40]:
# df confirmed partisan
df_cp = df_confirmed.copy()
df_cp = df_cp.sort_values('ConfirmedPercentage', ascending=False).drop_duplicates(['Province/State'])
df_cp = df_cp.merge(df_partisan, on='Province/State', how='inner')
df_cp.sort_values(by=['ConfirmedPercentage'], ascending=False).astype(str).head(10)
df_cp.tail(100)
Out[40]:
Province/State ObservationDate ConfirmedPercentage state year CitizenIdeology GovernmentIdeology
0 North Dakota 2020-12-06 10.987397 34.0 2015.0 32.864830 31.388329
1 South Dakota 2020-12-06 9.796307 41.0 2015.0 39.302322 29.607647
2 Iowa 2020-12-06 7.742031 15.0 2015.0 47.608620 33.479546
3 Wisconsin 2020-12-06 7.581012 49.0 2015.0 50.566242 22.858555
4 Nebraska 2020-12-06 7.235729 27.0 2015.0 29.198908 27.135815
5 Utah 2020-12-06 6.818090 44.0 2015.0 25.136684 19.874485
6 Montana 2020-12-06 6.389258 26.0 2015.0 46.466301 43.793865
7 Wyoming 2020-12-06 6.312836 50.0 2015.0 26.684616 21.069750
8 Idaho 2020-12-06 6.300959 12.0 2015.0 24.808130 22.535210
9 Minnesota 2020-12-06 6.234042 23.0 2015.0 52.254734 59.618053
10 Illinois 2020-12-06 6.168180 13.0 2015.0 57.911098 57.168694
11 Tennessee 2020-12-06 5.906233 42.0 2015.0 38.731262 24.088978
12 Kansas 2020-12-06 5.871616 16.0 2015.0 36.105789 22.314838
13 Rhode Island 2020-12-05 5.852529 39.0 2015.0 80.016350 68.289284
14 Indiana 2020-12-06 5.696092 14.0 2015.0 40.880875 24.779926
15 Arkansas 2020-12-06 5.659122 4.0 2015.0 36.244991 34.675446
16 Mississippi 2020-12-06 5.529430 24.0 2015.0 44.560547 31.340748
17 Alabama 2020-12-06 5.520143 1.0 2015.0 36.207596 25.924717
18 Nevada 2020-12-06 5.500480 28.0 2015.0 52.379105 40.495228
19 Oklahoma 2020-12-06 5.493842 36.0 2015.0 23.235449 23.841619
20 Missouri 2020-12-06 5.368742 25.0 2015.0 44.363598 42.468441
21 Louisiana 2020-12-06 5.363001 18.0 2015.0 40.553890 31.531614
22 New Mexico 2020-12-06 5.169923 31.0 2015.0 56.475410 42.315861
23 Arizona 2020-12-06 5.113436 3.0 2015.0 43.446938 19.375856
24 Alaska 2020-12-06 5.017966 2.0 2015.0 52.538380 39.996368
25 Florida 2020-12-06 4.964637 9.0 2015.0 48.224495 26.337852
26 Georgia 2020-12-06 4.754845 10.0 2015.0 43.246777 18.617783
27 Texas 2020-12-06 4.608148 43.0 2015.0 41.203247 20.944988
28 Colorado 2020-12-06 4.584301 6.0 2015.0 49.754890 57.053291
29 South Carolina 2020-12-06 4.560873 40.0 2015.0 41.808807 18.094116
30 Kentucky 2020-12-06 4.486116 17.0 2015.0 43.891407 44.179363
31 Michigan 2020-12-06 4.269527 22.0 2015.0 54.756756 27.480343
32 Delaware 2020-12-06 4.109640 8.0 2015.0 66.104446 60.680515
33 New Jersey 2020-12-06 4.074186 30.0 2015.0 63.421669 53.512871
34 Ohio 2020-12-06 4.061887 35.0 2015.0 48.638023 26.878523
35 North Carolina 2020-12-06 3.801582 33.0 2015.0 46.824627 27.300554
36 Massachusetts 2020-12-06 3.724581 21.0 2015.0 78.250504 66.258080
37 Connecticut 2020-12-04 3.558826 7.0 2015.0 94.820755 65.931198
38 New York 2020-12-06 3.553564 32.0 2015.0 70.520905 66.608749
39 Maryland 2020-12-06 3.536860 20.0 2015.0 60.716412 54.376816
40 California 2020-12-06 3.435852 5.0 2015.0 59.882900 70.195778
41 Pennsylvania 2020-12-06 3.299285 38.0 2015.0 52.347935 44.816399
42 West Virginia 2020-12-06 3.050175 48.0 2015.0 41.004856 49.842640
43 Virginia 2020-12-06 2.991592 46.0 2015.0 46.934807 44.350304
44 Washington 2020-12-06 2.356361 47.0 2015.0 54.966827 61.661667
45 Oregon 2020-12-06 2.012019 37.0 2015.0 58.208584 66.550529
46 New Hampshire 2020-12-06 1.842771 29.0 2015.0 53.932976 47.906269
47 Hawaii 2020-12-06 1.320954 11.0 2015.0 76.090012 69.053322
48 Maine 2020-12-06 0.994945 19.0 2015.0 67.041496 49.225498
49 Vermont 2020-12-06 0.803737 45.0 2015.0 86.200874 67.244621
In [41]:
x = df_cp['ConfirmedPercentage']
y = df_cp['CitizenIdeology']
ax = df_cp.plot(x='ConfirmedPercentage',y='CitizenIdeology',kind='scatter',figsize=(10,10))
# Add labels to each point
df_cp[['ConfirmedPercentage','CitizenIdeology','Province/State']].apply(lambda x: ax.text(*x),axis=1)

# Add a trendline
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")

plt.xlabel('Confirmed Percentage')
plt.ylabel('<-Conservative    Liberal->')
plt.title('Ideology vs Confirmed Percentage')

# Display r squared value
import scipy
from scipy import stats
slope, intercept, r, p, std_err = scipy.stats.linregress(x, y)
print("R Squared: "+str(r**2))
R Squared: 0.3244881816353702

While this correlation isn't perfect, it was much higher than I was expecting. An r squared value of 0.32 using real world data that is affected by many different factor is reasonably significant for our analysis. There certainly are some outliers, but overall this data follows a noticeable trend between political ideology and percent of confirmed cases. I believe this will be useful for our algorithm.

Next, let's take a look at population density to see if we can find any correlation between that and confirmed percentage of cases.

In [42]:
# Data from: https://www.census.gov/data/tables/2010/dec/density-data-text.html
df_density = pd.read_csv('Data/pop_density.csv')
# df_density = df_density[['STATE_OR_REGION','2010_POPULATION','2010_DENSITY']]
df_density['2010_DENSITY'] = df_density['2010_DENSITY'].str.replace(',','')
df_density['2010_DENSITY'] = pd.to_numeric(df_density['2010_DENSITY'],errors='coerce')
df_density.head(100)
Out[42]:
STATE_OR_REGION 1910_POPULATION 1920_POPULATION 1930_POPULATION 1940_POPULATION 1950_POPULATION 1960_POPULATION 1970_POPULATION 1980_POPULATION 1990_POPULATION ... 1920_RANK 1930_RANK 1940_RANK 1950_RANK 1960_RANK 1970_RANK 1980_RANK 1990_RANK 2000_RANK 2010_RANK
0 United States 92228531 106021568 123202660 132165129 151325798 179323175 203211926 226545805 248709873 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Alabama 2138093 2348174 2646248 2832961 3061743 3266740 3444165 3893888 4040587 ... 25.0 24.0 23.0 24.0 28.0 28.0 28.0 27.0 28.0 29.0
2 Alaska 64356 55036 59278 72524 128643 226167 300382 401851 550043 ... 52.0 52.0 52.0 52.0 52.0 52.0 52.0 52.0 52.0 52.0
3 Arizona 204354 334162 435573 499261 749587 1302161 1770900 2718215 3665228 ... 49.0 47.0 47.0 47.0 43.0 43.0 42.0 39.0 38.0 35.0
4 Arkansas 1574449 1752204 1854482 1949387 1909511 1786272 1923295 2286435 2350725 ... 31.0 32.0 32.0 34.0 36.0 37.0 37.0 37.0 36.0 36.0
5 California 2377549 3426861 5677251 6907387 10586223 15717204 19953134 23667902 29760021 ... 35.0 31.0 30.0 22.0 15.0 15.0 16.0 14.0 14.0 13.0
6 Colorado 799024 939629 1035791 1123296 1325089 1753947 2207259 2889964 3294394 ... 42.0 41.0 42.0 42.0 42.0 41.0 40.0 40.0 39.0 39.0
7 Connecticut 1114756 1380631 1606903 1709242 2007280 2535234 3031709 3107576 3287116 ... 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
8 Delaware 202322 223003 238380 266505 318085 446292 548104 594338 666168 ... 12.0 12.0 12.0 11.0 11.0 9.0 9.0 9.0 9.0 8.0
9 District of Columbia 331069 437571 486869 663091 802178 763956 756510 638333 606900 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
10 Florida 752619 968470 1468211 1897414 2771305 4951560 6789443 9746324 12937926 ... 38.0 35.0 33.0 29.0 19.0 16.0 13.0 12.0 10.0 10.0
11 Georgia 2609121 2895832 2908506 3123723 3444578 3943116 4589575 5463105 6478216 ... 21.0 26.0 27.0 27.0 26.0 26.0 24.0 23.0 20.0 20.0
12 Hawaii 191909 255912 368336 423330 499794 632772 768561 964691 1108229 ... 28.0 21.0 20.0 19.0 17.0 17.0 17.0 15.0 15.0 15.0
13 Idaho 325594 431866 445032 524873 588637 667191 712567 943935 1006749 ... 46.0 46.0 46.0 46.0 47.0 47.0 45.0 46.0 46.0 46.0
14 Illinois 5638591 6485280 7630654 7897241 8712176 10081158 11113976 11426518 11430602 ... 11.0 11.0 11.0 12.0 12.0 12.0 12.0 13.0 13.0 14.0
15 Indiana 2700876 2930390 3238503 3427796 3934224 4662498 5193669 5490224 5544159 ... 13.0 13.0 13.0 14.0 14.0 14.0 15.0 18.0 18.0 18.0
16 Iowa 2224771 2404021 2470939 2538268 2621073 2757537 2824376 2913808 2776755 ... 26.0 28.0 29.0 30.0 30.0 31.0 34.0 35.0 35.0 38.0
17 Kansas 1690949 1769257 1880999 1801028 1905299 2178611 2246578 2363679 2477574 ... 36.0 38.0 39.0 39.0 39.0 39.0 39.0 41.0 42.0 42.0
18 Kentucky 2289905 2416630 2614589 2845627 2944806 3038156 3218706 3660777 3685296 ... 15.0 16.0 17.0 20.0 23.0 25.0 25.0 25.0 25.0 24.0
19 Louisiana 1656388 1798509 2101593 2363880 2683516 3257022 3641306 4205900 4219973 ... 27.0 27.0 26.0 25.0 24.0 22.0 23.0 24.0 24.0 26.0
20 Maine 742371 768014 797423 847226 913774 969265 992048 1124660 1227928 ... 34.0 36.0 36.0 37.0 38.0 38.0 38.0 38.0 40.0 40.0
21 Maryland 1295346 1449661 1631526 1821244 2343001 3100689 3922399 4216975 4781468 ... 9.0 9.0 9.0 8.0 8.0 7.0 7.0 7.0 7.0 7.0
22 Massachusetts 3366416 3852356 4249614 4316721 4690514 5148578 5689170 5737037 6016425 ... 3.0 4.0 4.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
23 Michigan 2810173 3668412 4842325 5256106 6371766 7823194 8875083 9262078 9295297 ... 14.0 14.0 14.0 13.0 13.0 13.0 14.0 16.0 17.0 19.0
24 Minnesota 2075708 2387125 2563953 2792300 2982483 3413864 3804971 4075970 4375099 ... 32.0 34.0 34.0 33.0 32.0 33.0 35.0 33.0 33.0 33.0
25 Mississippi 1797114 1790618 2009821 2183796 2178914 2178141 2216912 2520638 2573216 ... 29.0 29.0 28.0 31.0 31.0 34.0 33.0 34.0 34.0 34.0
26 Missouri 3293335 3404055 3629367 3784664 3954653 4319813 4676501 4916686 5117073 ... 22.0 23.0 24.0 28.0 29.0 28.0 29.0 29.0 29.0 30.0
27 Montana 376053 548889 537606 559456 591024 674767 694409 786690 799065 ... 47.0 48.0 49.0 49.0 49.0 49.0 50.0 50.0 50.0 50.0
28 Nebraska 1192214 1296372 1377963 1315834 1325510 1411330 1483493 1569825 1578385 ... 40.0 40.0 40.0 40.0 40.0 42.0 43.0 44.0 44.0 45.0
29 Nevada 81875 77407 91058 110247 160083 285278 488738 800493 1201833 ... 51.0 51.0 51.0 51.0 51.0 50.0 49.0 47.0 45.0 44.0
30 New Hampshire 430572 443083 465293 491524 533242 606921 737681 920610 1109252 ... 22.0 25.0 25.0 26.0 27.0 23.0 22.0 20.0 22.0 23.0
31 New Jersey 2537167 3155900 4041334 4160165 4835329 6066782 7168164 7364823 7730188 ... 4.0 3.0 3.0 4.0 3.0 2.0 2.0 2.0 2.0 2.0
32 New Mexico 327301 360350 423317 531818 681187 951023 1016000 1302894 1515069 ... 48.0 49.0 47.0 48.0 48.0 48.0 46.0 45.0 47.0 47.0
33 New York 9113614 10385227 12588066 13479142 14830192 16782304 18236967 17558072 17990455 ... 7.0 7.0 7.0 7.0 7.0 8.0 8.0 8.0 8.0 9.0
34 North Carolina 2206287 2559123 3170276 3571623 4061929 4556155 5082059 5881766 6628637 ... 20.0 17.0 16.0 17.0 18.0 19.0 19.0 19.0 19.0 17.0
35 North Dakota 577056 646872 680845 641935 619636 632446 617761 652717 638800 ... 41.0 42.0 43.0 43.0 45.0 45.0 47.0 48.0 49.0 49.0
36 Ohio 4767121 5759394 6646697 6907612 7946627 9706397 10652017 10797630 10847115 ... 10.0 10.0 10.0 10.0 10.0 11.0 11.0 11.0 11.0 12.0
37 Oklahoma 1657155 2028283 2396040 2336434 2233351 2328284 2559229 3025290 3145585 ... 33.0 33.0 35.0 36.0 37.0 36.0 36.0 36.0 37.0 37.0
38 Oregon 672765 783389 953786 1089684 1521341 1768687 2091385 2633105 2842321 ... 44.0 42.0 41.0 41.0 40.0 40.0 41.0 42.0 41.0 41.0
39 Pennsylvania 7665111 8720017 9631350 9900180 10498012 11319366 11793909 11863895 11881643 ... 8.0 8.0 8.0 9.0 9.0 10.0 10.0 10.0 12.0 11.0
40 Rhode Island 542610 604397 687497 713346 791896 859488 946725 947154 1003464 ... 2.0 2.0 2.0 3.0 2.0 3.0 4.0 4.0 4.0 4.0
41 South Carolina 1515400 1683724 1738765 1899804 2117027 2382594 2590516 3121820 3486703 ... 19.0 20.0 21.0 21.0 21.0 21.0 21.0 22.0 23.0 22.0
42 South Dakota 583888 636547 692849 642961 652740 680514 665507 690768 696004 ... 43.0 44.0 44.0 44.0 46.0 46.0 48.0 49.0 48.0 48.0
43 Tennessee 2184789 2337885 2616556 2915841 3291718 3567089 3923687 4591120 4877185 ... 18.0 18.0 18.0 18.0 20.0 20.0 20.0 21.0 21.0 21.0
44 Texas 3896542 4663228 5824715 6414824 7711194 9579677 11196730 14229191 16986510 ... 39.0 39.0 38.0 38.0 35.0 35.0 32.0 31.0 30.0 28.0
45 Utah 373351 449396 507847 550310 688862 890627 1059273 1461037 1722850 ... 45.0 45.0 45.0 45.0 44.0 44.0 44.0 43.0 43.0 43.0
46 Vermont 355956 352428 359611 359231 377747 389881 444330 511456 562758 ... 30.0 30.0 31.0 32.0 34.0 32.0 31.0 32.0 32.0 32.0
47 Virginia 2061612 2309187 2421851 2677773 3318680 3966949 4648494 5346818 6187358 ... 17.0 19.0 19.0 16.0 16.0 18.0 18.0 17.0 16.0 16.0
48 Washington 1141990 1356621 1563396 1736191 2378963 2853214 3409169 4132156 4866692 ... 37.0 37.0 37.0 35.0 32.0 30.0 30.0 30.0 27.0 27.0
49 West Virginia 1221119 1463701 1729205 1901974 2005552 1860421 1744237 1949644 1793477 ... 15.0 15.0 15.0 15.0 22.0 27.0 27.0 28.0 31.0 31.0
50 Wisconsin 2333860 2632067 2939006 3137587 3434575 3951777 4417731 4705767 4891769 ... 24.0 22.0 22.0 23.0 25.0 24.0 26.0 26.0 26.0 25.0
51 Wyoming 145965 194402 225565 250742 290529 330066 332416 469557 453588 ... 50.0 50.0 50.0 50.0 50.0 51.0 51.0 51.0 51.0 51.0
52 Puerto Rico 1118012 1299809 1543913 1869255 2210703 2349544 2712033 3196520 3522037 ... 5.0 5.0 5.0 2.0 4.0 4.0 3.0 3.0 3.0 3.0

53 rows × 34 columns

In [43]:
# df confirmed partisan density
df_cpd = df_cp.copy()
df_density = df_density.rename(columns={'STATE_OR_REGION': 'Province/State'})
df_density = df_density[['Province/State','2010_DENSITY']]
df_cpd = df_cpd.merge(df_density, on='Province/State', how='inner')


x = df_cpd['ConfirmedPercentage']
y = df_cpd['2010_DENSITY']
ax = df_cpd.plot(x='ConfirmedPercentage',y='2010_DENSITY',kind='scatter',figsize=(10,10))
# Add labels to each point
df_cpd[['ConfirmedPercentage','2010_DENSITY','Province/State']].apply(lambda x: ax.text(*x),axis=1)

# Add a trendline
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")

plt.xlabel('Confirmed Percentage')
plt.ylabel('Density (People per square mile)')
plt.title('Density vs Confirmed Percentage')

# Display r squared value
import scipy
from scipy import stats
slope, intercept, r, p, std_err = scipy.stats.linregress(x, y)
print("R Squared: "+str(r**2))
# df_density.head(100)
R Squared: 0.05136597912659773

Unfortunately, this r squared value is significantly lower than the correlation for political ideology. Our outlier data seems to be skewing our trendline as well. This is surprising because you might expext that more densely populated states to have higher rates since there are more people per unit area.

Next, we will import some data from census.gov with different data points and population numbers for each of them for each state. For this next graph, we will be looking at the average age of citizens in each state to see if we can find a correlation between average age and covid percentage rates.

In [44]:
# Data from: https://www.census.gov/data/datasets/time-series/demo/popest/2010s-state-detail.html
# df_asr: df age sex race
df_asr = pd.read_csv('Data/sc-est2019-alldata6.csv')
# Remove columns we don't need
df_asr = df_asr[['NAME','SEX','ORIGIN','RACE','AGE','POPESTIMATE2019']]
In [45]:
df_age = df_asr.copy()
df_age = df_age[df_age.SEX != 0]
df_age = df_age[df_age.ORIGIN != 0]
df_age = df_age[['NAME','AGE','POPESTIMATE2019']]
df_age = df_age.groupby(['NAME', 'AGE'], as_index=False).sum()
df_age['weighted'] = df_age['AGE'] * df_age['POPESTIMATE2019']
df_age = df_age.groupby(['NAME'], as_index=False).sum()
df_age['AverageAge'] = df_age['weighted'] / df_age['POPESTIMATE2019']
df_age = df_age[['NAME','AverageAge']]
df_age.head(10)
Out[45]:
NAME AverageAge
0 Alabama 39.482018
1 Alaska 36.330879
2 Arizona 39.170100
3 Arkansas 39.016298
4 California 38.033199
5 Colorado 38.106171
6 Connecticut 40.588005
7 Delaware 40.828502
8 District of Columbia 36.648714
9 Florida 41.874931
In [46]:
df_cpa = df_cpd.copy()

df_age = df_age.rename(columns={'NAME': 'Province/State'})
df_cpa = df_cpa.merge(df_age, on='Province/State', how='inner')
df_cpa.head(100)
Out[46]:
Province/State ObservationDate ConfirmedPercentage state year CitizenIdeology GovernmentIdeology 2010_DENSITY AverageAge
0 North Dakota 2020-12-06 10.987397 34.0 2015.0 32.864830 31.388329 9.7 37.518283
1 South Dakota 2020-12-06 9.796307 41.0 2015.0 39.302322 29.607647 10.7 38.462802
2 Iowa 2020-12-06 7.742031 15.0 2015.0 47.608620 33.479546 54.5 39.130726
3 Wisconsin 2020-12-06 7.581012 49.0 2015.0 50.566242 22.858555 105.0 39.849338
4 Nebraska 2020-12-06 7.235729 27.0 2015.0 29.198908 27.135815 23.8 37.922156
5 Utah 2020-12-06 6.818090 44.0 2015.0 25.136684 19.874485 33.6 33.714779
6 Montana 2020-12-06 6.389258 26.0 2015.0 46.466301 43.793865 6.8 40.394879
7 Wyoming 2020-12-06 6.312836 50.0 2015.0 26.684616 21.069750 5.8 39.021311
8 Idaho 2020-12-06 6.300959 12.0 2015.0 24.808130 22.535210 19.0 37.841917
9 Minnesota 2020-12-06 6.234042 23.0 2015.0 52.254734 59.618053 66.6 38.857877
10 Illinois 2020-12-06 6.168180 13.0 2015.0 57.911098 57.168694 231.1 38.997553
11 Tennessee 2020-12-06 5.906233 42.0 2015.0 38.731262 24.088978 153.9 39.214731
12 Kansas 2020-12-06 5.871616 16.0 2015.0 36.105789 22.314838 34.9 38.193672
13 Rhode Island 2020-12-05 5.852529 39.0 2015.0 80.016350 68.289284 1018.1 40.451412
14 Indiana 2020-12-06 5.696092 14.0 2015.0 40.880875 24.779926 181.0 38.511627
15 Arkansas 2020-12-06 5.659122 4.0 2015.0 36.244991 34.675446 56.0 39.016298
16 Mississippi 2020-12-06 5.529430 24.0 2015.0 44.560547 31.340748 63.2 38.562118
17 Alabama 2020-12-06 5.520143 1.0 2015.0 36.207596 25.924717 94.4 39.482018
18 Nevada 2020-12-06 5.500480 28.0 2015.0 52.379105 40.495228 24.6 38.793240
19 Oklahoma 2020-12-06 5.493842 36.0 2015.0 23.235449 23.841619 54.7 38.034852
20 Missouri 2020-12-06 5.368742 25.0 2015.0 44.363598 42.468441 87.1 39.415176
21 Louisiana 2020-12-06 5.363001 18.0 2015.0 40.553890 31.531614 104.9 38.371583
22 New Mexico 2020-12-06 5.169923 31.0 2015.0 56.475410 42.315861 17.0 39.314857
23 Arizona 2020-12-06 5.113436 3.0 2015.0 43.446938 19.375856 56.3 39.170100
24 Alaska 2020-12-06 5.017966 2.0 2015.0 52.538380 39.996368 1.2 36.330879
25 Florida 2020-12-06 4.964637 9.0 2015.0 48.224495 26.337852 350.6 41.874931
26 Georgia 2020-12-06 4.754845 10.0 2015.0 43.246777 18.617783 168.4 37.656965
27 Texas 2020-12-06 4.608148 43.0 2015.0 41.203247 20.944988 96.3 36.172459
28 Colorado 2020-12-06 4.584301 6.0 2015.0 49.754890 57.053291 48.5 38.106171
29 South Carolina 2020-12-06 4.560873 40.0 2015.0 41.808807 18.094116 153.9 39.931227
30 Kentucky 2020-12-06 4.486116 17.0 2015.0 43.891407 44.179363 109.9 39.190210
31 Michigan 2020-12-06 4.269527 22.0 2015.0 54.756756 27.480343 174.8 39.934856
32 Delaware 2020-12-06 4.109640 8.0 2015.0 66.104446 60.680515 460.8 40.828502
33 New Jersey 2020-12-06 4.074186 30.0 2015.0 63.421669 53.512871 1195.5 39.771274
34 Ohio 2020-12-06 4.061887 35.0 2015.0 48.638023 26.878523 282.3 39.686289
35 North Carolina 2020-12-06 3.801582 33.0 2015.0 46.824627 27.300554 196.1 39.216283
36 Massachusetts 2020-12-06 3.724581 21.0 2015.0 78.250504 66.258080 839.4 40.035757
37 Connecticut 2020-12-04 3.558826 7.0 2015.0 94.820755 65.931198 738.1 40.588005
38 New York 2020-12-06 3.553564 32.0 2015.0 70.520905 66.608749 411.2 39.737657
39 Maryland 2020-12-06 3.536860 20.0 2015.0 60.716412 54.376816 594.8 39.152902
40 California 2020-12-06 3.435852 5.0 2015.0 59.882900 70.195778 239.1 38.033199
41 Pennsylvania 2020-12-06 3.299285 38.0 2015.0 52.347935 44.816399 283.9 40.722776
42 West Virginia 2020-12-06 3.050175 48.0 2015.0 41.004856 49.842640 77.1 41.694460
43 Virginia 2020-12-06 2.991592 46.0 2015.0 46.934807 44.350304 202.6 38.963615
44 Washington 2020-12-06 2.356361 47.0 2015.0 54.966827 61.661667 101.2 38.708887
45 Oregon 2020-12-06 2.012019 37.0 2015.0 58.208584 66.550529 39.9 40.085323
46 New Hampshire 2020-12-06 1.842771 29.0 2015.0 53.932976 47.906269 147.0 41.687916
47 Hawaii 2020-12-06 1.320954 11.0 2015.0 76.090012 69.053322 211.8 40.234645
48 Maine 2020-12-06 0.994945 19.0 2015.0 67.041496 49.225498 43.1 42.877923
49 Vermont 2020-12-06 0.803737 45.0 2015.0 86.200874 67.244621 67.9 41.891014
In [47]:
x = df_cpa['ConfirmedPercentage']
y = df_cpa['AverageAge']
ax = df_cpa.plot(x='ConfirmedPercentage',y='AverageAge',kind='scatter',figsize=(10,10))
# Add labels to each point
df_cpa[['ConfirmedPercentage','AverageAge','Province/State']].apply(lambda x: ax.text(*x),axis=1)

# Add a trendline
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")

plt.xlabel('Confirmed Percentage')
plt.ylabel('Average Age')
plt.title('Average Age vs Confirmed Percentage')

# Display r squared value
import scipy
from scipy import stats
slope, intercept, r, p, std_err = scipy.stats.linregress(x, y)
print("R Squared: "+str(r**2))
# df_cpa.head(100)
R Squared: 0.256351432924362

With an r squared value of 0.26, we are getting a correlation closer to what we had with the political ideology. There seems to be a correlation between younger average age and higher percentages of covid cases. This would seem to make sense, since younger people are less severely affected by covid and are more likely to expose themselves to more people. While we can't necessarily make any conclusions about why this correlation exists, it would seem that there is somewhat of a correlation between average age and confirmed case percentages.

While I don't expect to see any correlation with this next set of data, let's run the analysis since we alread have the data imported into our table. This next graph will be comparing the "average" sex (ratio of male to female with male being 1 and female being 2) of each state with its covid percentage rates.

In [48]:
df_sex = df_asr.copy()
df_sex = df_sex[df_sex.SEX > 0]
df_sex = df_sex[df_sex.ORIGIN != 0]
# df_sex = df_sex[['NAME','SEX','AGE','POPESTIMATE2019']]
df_sex = df_sex.groupby(['NAME', 'SEX'], as_index=False).sum()
df_sex['weighted'] = df_sex['SEX'] * df_sex['POPESTIMATE2019']
df_sex = df_sex.groupby(['NAME'], as_index=False).sum()
df_sex['AverageSex'] = df_sex['weighted'] / df_sex['POPESTIMATE2019']
df_sex = df_sex[['NAME','AverageSex']]
df_sex.head(100)
Out[48]:
NAME AverageSex
0 Alabama 1.516739
1 Alaska 1.478613
2 Arizona 1.503031
3 Arkansas 1.509042
4 California 1.502816
5 Colorado 1.496183
6 Connecticut 1.512124
7 Delaware 1.516522
8 District of Columbia 1.525736
9 Florida 1.511259
10 Georgia 1.513966
11 Hawaii 1.499927
12 Idaho 1.498660
13 Illinois 1.508630
14 Indiana 1.506796
15 Iowa 1.502047
16 Kansas 1.501703
17 Kentucky 1.507343
18 Louisiana 1.512336
19 Maine 1.510411
20 Maryland 1.515544
21 Massachusetts 1.514608
22 Michigan 1.507500
23 Minnesota 1.501774
24 Mississippi 1.515383
25 Missouri 1.509133
26 Montana 1.496560
27 Nebraska 1.500390
28 Nevada 1.498458
29 New Hampshire 1.504459
30 New Jersey 1.511316
31 New Mexico 1.505238
32 New York 1.514338
33 North Carolina 1.513709
34 North Dakota 1.488189
35 Ohio 1.509752
36 Oklahoma 1.504505
37 Oregon 1.504301
38 Pennsylvania 1.509892
39 Rhode Island 1.513221
40 South Carolina 1.515774
41 South Dakota 1.494995
42 Tennessee 1.512058
43 Texas 1.503285
44 Utah 1.496276
45 Vermont 1.505895
46 Virginia 1.507908
47 Washington 1.499404
48 West Virginia 1.504633
49 Wisconsin 1.502406
50 Wyoming 1.490755
In [49]:
df_cpas = df_cpa.copy()

df_sex = df_sex.rename(columns={'NAME': 'Province/State'})
df_cpas = df_cpas.merge(df_sex, on='Province/State', how='inner')
df_cpas.head(100)
Out[49]:
Province/State ObservationDate ConfirmedPercentage state year CitizenIdeology GovernmentIdeology 2010_DENSITY AverageAge AverageSex
0 North Dakota 2020-12-06 10.987397 34.0 2015.0 32.864830 31.388329 9.7 37.518283 1.488189
1 South Dakota 2020-12-06 9.796307 41.0 2015.0 39.302322 29.607647 10.7 38.462802 1.494995
2 Iowa 2020-12-06 7.742031 15.0 2015.0 47.608620 33.479546 54.5 39.130726 1.502047
3 Wisconsin 2020-12-06 7.581012 49.0 2015.0 50.566242 22.858555 105.0 39.849338 1.502406
4 Nebraska 2020-12-06 7.235729 27.0 2015.0 29.198908 27.135815 23.8 37.922156 1.500390
5 Utah 2020-12-06 6.818090 44.0 2015.0 25.136684 19.874485 33.6 33.714779 1.496276
6 Montana 2020-12-06 6.389258 26.0 2015.0 46.466301 43.793865 6.8 40.394879 1.496560
7 Wyoming 2020-12-06 6.312836 50.0 2015.0 26.684616 21.069750 5.8 39.021311 1.490755
8 Idaho 2020-12-06 6.300959 12.0 2015.0 24.808130 22.535210 19.0 37.841917 1.498660
9 Minnesota 2020-12-06 6.234042 23.0 2015.0 52.254734 59.618053 66.6 38.857877 1.501774
10 Illinois 2020-12-06 6.168180 13.0 2015.0 57.911098 57.168694 231.1 38.997553 1.508630
11 Tennessee 2020-12-06 5.906233 42.0 2015.0 38.731262 24.088978 153.9 39.214731 1.512058
12 Kansas 2020-12-06 5.871616 16.0 2015.0 36.105789 22.314838 34.9 38.193672 1.501703
13 Rhode Island 2020-12-05 5.852529 39.0 2015.0 80.016350 68.289284 1018.1 40.451412 1.513221
14 Indiana 2020-12-06 5.696092 14.0 2015.0 40.880875 24.779926 181.0 38.511627 1.506796
15 Arkansas 2020-12-06 5.659122 4.0 2015.0 36.244991 34.675446 56.0 39.016298 1.509042
16 Mississippi 2020-12-06 5.529430 24.0 2015.0 44.560547 31.340748 63.2 38.562118 1.515383
17 Alabama 2020-12-06 5.520143 1.0 2015.0 36.207596 25.924717 94.4 39.482018 1.516739
18 Nevada 2020-12-06 5.500480 28.0 2015.0 52.379105 40.495228 24.6 38.793240 1.498458
19 Oklahoma 2020-12-06 5.493842 36.0 2015.0 23.235449 23.841619 54.7 38.034852 1.504505
20 Missouri 2020-12-06 5.368742 25.0 2015.0 44.363598 42.468441 87.1 39.415176 1.509133
21 Louisiana 2020-12-06 5.363001 18.0 2015.0 40.553890 31.531614 104.9 38.371583 1.512336
22 New Mexico 2020-12-06 5.169923 31.0 2015.0 56.475410 42.315861 17.0 39.314857 1.505238
23 Arizona 2020-12-06 5.113436 3.0 2015.0 43.446938 19.375856 56.3 39.170100 1.503031
24 Alaska 2020-12-06 5.017966 2.0 2015.0 52.538380 39.996368 1.2 36.330879 1.478613
25 Florida 2020-12-06 4.964637 9.0 2015.0 48.224495 26.337852 350.6 41.874931 1.511259
26 Georgia 2020-12-06 4.754845 10.0 2015.0 43.246777 18.617783 168.4 37.656965 1.513966
27 Texas 2020-12-06 4.608148 43.0 2015.0 41.203247 20.944988 96.3 36.172459 1.503285
28 Colorado 2020-12-06 4.584301 6.0 2015.0 49.754890 57.053291 48.5 38.106171 1.496183
29 South Carolina 2020-12-06 4.560873 40.0 2015.0 41.808807 18.094116 153.9 39.931227 1.515774
30 Kentucky 2020-12-06 4.486116 17.0 2015.0 43.891407 44.179363 109.9 39.190210 1.507343
31 Michigan 2020-12-06 4.269527 22.0 2015.0 54.756756 27.480343 174.8 39.934856 1.507500
32 Delaware 2020-12-06 4.109640 8.0 2015.0 66.104446 60.680515 460.8 40.828502 1.516522
33 New Jersey 2020-12-06 4.074186 30.0 2015.0 63.421669 53.512871 1195.5 39.771274 1.511316
34 Ohio 2020-12-06 4.061887 35.0 2015.0 48.638023 26.878523 282.3 39.686289 1.509752
35 North Carolina 2020-12-06 3.801582 33.0 2015.0 46.824627 27.300554 196.1 39.216283 1.513709
36 Massachusetts 2020-12-06 3.724581 21.0 2015.0 78.250504 66.258080 839.4 40.035757 1.514608
37 Connecticut 2020-12-04 3.558826 7.0 2015.0 94.820755 65.931198 738.1 40.588005 1.512124
38 New York 2020-12-06 3.553564 32.0 2015.0 70.520905 66.608749 411.2 39.737657 1.514338
39 Maryland 2020-12-06 3.536860 20.0 2015.0 60.716412 54.376816 594.8 39.152902 1.515544
40 California 2020-12-06 3.435852 5.0 2015.0 59.882900 70.195778 239.1 38.033199 1.502816
41 Pennsylvania 2020-12-06 3.299285 38.0 2015.0 52.347935 44.816399 283.9 40.722776 1.509892
42 West Virginia 2020-12-06 3.050175 48.0 2015.0 41.004856 49.842640 77.1 41.694460 1.504633
43 Virginia 2020-12-06 2.991592 46.0 2015.0 46.934807 44.350304 202.6 38.963615 1.507908
44 Washington 2020-12-06 2.356361 47.0 2015.0 54.966827 61.661667 101.2 38.708887 1.499404
45 Oregon 2020-12-06 2.012019 37.0 2015.0 58.208584 66.550529 39.9 40.085323 1.504301
46 New Hampshire 2020-12-06 1.842771 29.0 2015.0 53.932976 47.906269 147.0 41.687916 1.504459
47 Hawaii 2020-12-06 1.320954 11.0 2015.0 76.090012 69.053322 211.8 40.234645 1.499927
48 Maine 2020-12-06 0.994945 19.0 2015.0 67.041496 49.225498 43.1 42.877923 1.510411
49 Vermont 2020-12-06 0.803737 45.0 2015.0 86.200874 67.244621 67.9 41.891014 1.505895
In [50]:
x = df_cpas['ConfirmedPercentage']
y = df_cpas['AverageSex']
ax = df_cpas.plot(x='ConfirmedPercentage',y='AverageSex',kind='scatter',figsize=(10,10))
# Add labels to each point
df_cpas[['ConfirmedPercentage','AverageSex','Province/State']].apply(lambda x: ax.text(*x),axis=1)

# Add a trendline
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")

plt.xlabel('Confirmed Percentage')
plt.ylabel('Average Sex (Male-1 Female-2)')
plt.title('Sex vs Confirmed Percentage')

# Display r squared value
import scipy
from scipy import stats
slope, intercept, r, p, std_err = scipy.stats.linregress(x, y)
print("R Squared: "+str(r**2))
R Squared: 0.13591456000979601

Unsurprisingly, there doesn't seem to be a significant correlation between sex and covid rates, however our r squared value is slightly higher than that of density at 0.14. If any correlation exists, it would seem to imply that high ratios of men related to higher percentages of covid.

Now let's take a look at average temperature and covid rates per state. I expect to see some correlation here, since warmer or colder weather would affect how much people are inside or outside.

In [51]:
# Data from: https://corgis-edu.github.io/corgis/csv/weather/
df_weather = pd.read_csv('Data/weather.csv')

# Fix column labels and remove unnecessary data
df_weather['Station.State'] = df_weather['Station.State'].replace(['DE'],'Delaware')
df_weather['Station.State'] = df_weather['Station.State'].replace(['VA'],'Virginia')
df_weather = df_weather[df_weather['Station.State'] != 'Puerto Rico']

df_weather = df_weather.groupby(['Station.State'], as_index=False).mean()
df_weather = df_weather[['Station.State','Data.Temperature.Avg Temp']]

df_weather.head(100)
Out[51]:
Station.State Data.Temperature.Avg Temp
0 Alabama 66.528302
1 Alaska 41.157650
2 Arizona 66.286792
3 Arkansas 62.308176
4 California 62.363363
5 Colorado 49.273585
6 Connecticut 53.396226
7 Delaware 56.056604
8 Florida 73.926101
9 Georgia 66.220126
10 Hawaii 76.384906
11 Idaho 51.698113
12 Illinois 53.377358
13 Indiana 53.976415
14 Iowa 50.796226
15 Kansas 56.867925
16 Kentucky 58.286792
17 Louisiana 69.475096
18 Maine 45.415094
19 Maryland 56.971698
20 Massachusetts 53.261438
21 Michigan 48.002096
22 Minnesota 44.675472
23 Mississippi 66.474394
24 Missouri 57.423181
25 Montana 45.627787
26 Nebraska 51.634434
27 Nevada 55.628931
28 New Hampshire 51.110000
29 New Jersey 56.066038
30 New Mexico 58.622642
31 New York 51.040094
32 North Carolina 61.770889
33 North Dakota 45.166038
34 Ohio 52.665768
35 Oklahoma 61.716981
36 Oregon 52.870283
37 Pennsylvania 53.919137
38 Rhode Island 53.037736
39 South Carolina 65.188679
40 South Dakota 48.415094
41 Tennessee 60.802381
42 Texas 68.181604
43 Utah 55.698113
44 Vermont 48.773585
45 Virginia 58.876011
46 Washington 51.366577
47 West Virginia 55.023585
48 Wisconsin 49.165094
49 Wyoming 46.490566
In [52]:
df_cpast = df_cpas.copy()

df_weather = df_weather.rename(columns={'Station.State': 'Province/State'})
df_weather = df_weather.rename(columns={'Data.Temperature.Avg Temp': 'AverageTemp'})
df_cpast = df_cpast.merge(df_weather, on='Province/State', how='inner')
df_cpast.head(100)
Out[52]:
Province/State ObservationDate ConfirmedPercentage state year CitizenIdeology GovernmentIdeology 2010_DENSITY AverageAge AverageSex AverageTemp
0 North Dakota 2020-12-06 10.987397 34.0 2015.0 32.864830 31.388329 9.7 37.518283 1.488189 45.166038
1 South Dakota 2020-12-06 9.796307 41.0 2015.0 39.302322 29.607647 10.7 38.462802 1.494995 48.415094
2 Iowa 2020-12-06 7.742031 15.0 2015.0 47.608620 33.479546 54.5 39.130726 1.502047 50.796226
3 Wisconsin 2020-12-06 7.581012 49.0 2015.0 50.566242 22.858555 105.0 39.849338 1.502406 49.165094
4 Nebraska 2020-12-06 7.235729 27.0 2015.0 29.198908 27.135815 23.8 37.922156 1.500390 51.634434
5 Utah 2020-12-06 6.818090 44.0 2015.0 25.136684 19.874485 33.6 33.714779 1.496276 55.698113
6 Montana 2020-12-06 6.389258 26.0 2015.0 46.466301 43.793865 6.8 40.394879 1.496560 45.627787
7 Wyoming 2020-12-06 6.312836 50.0 2015.0 26.684616 21.069750 5.8 39.021311 1.490755 46.490566
8 Idaho 2020-12-06 6.300959 12.0 2015.0 24.808130 22.535210 19.0 37.841917 1.498660 51.698113
9 Minnesota 2020-12-06 6.234042 23.0 2015.0 52.254734 59.618053 66.6 38.857877 1.501774 44.675472
10 Illinois 2020-12-06 6.168180 13.0 2015.0 57.911098 57.168694 231.1 38.997553 1.508630 53.377358
11 Tennessee 2020-12-06 5.906233 42.0 2015.0 38.731262 24.088978 153.9 39.214731 1.512058 60.802381
12 Kansas 2020-12-06 5.871616 16.0 2015.0 36.105789 22.314838 34.9 38.193672 1.501703 56.867925
13 Rhode Island 2020-12-05 5.852529 39.0 2015.0 80.016350 68.289284 1018.1 40.451412 1.513221 53.037736
14 Indiana 2020-12-06 5.696092 14.0 2015.0 40.880875 24.779926 181.0 38.511627 1.506796 53.976415
15 Arkansas 2020-12-06 5.659122 4.0 2015.0 36.244991 34.675446 56.0 39.016298 1.509042 62.308176
16 Mississippi 2020-12-06 5.529430 24.0 2015.0 44.560547 31.340748 63.2 38.562118 1.515383 66.474394
17 Alabama 2020-12-06 5.520143 1.0 2015.0 36.207596 25.924717 94.4 39.482018 1.516739 66.528302
18 Nevada 2020-12-06 5.500480 28.0 2015.0 52.379105 40.495228 24.6 38.793240 1.498458 55.628931
19 Oklahoma 2020-12-06 5.493842 36.0 2015.0 23.235449 23.841619 54.7 38.034852 1.504505 61.716981
20 Missouri 2020-12-06 5.368742 25.0 2015.0 44.363598 42.468441 87.1 39.415176 1.509133 57.423181
21 Louisiana 2020-12-06 5.363001 18.0 2015.0 40.553890 31.531614 104.9 38.371583 1.512336 69.475096
22 New Mexico 2020-12-06 5.169923 31.0 2015.0 56.475410 42.315861 17.0 39.314857 1.505238 58.622642
23 Arizona 2020-12-06 5.113436 3.0 2015.0 43.446938 19.375856 56.3 39.170100 1.503031 66.286792
24 Alaska 2020-12-06 5.017966 2.0 2015.0 52.538380 39.996368 1.2 36.330879 1.478613 41.157650
25 Florida 2020-12-06 4.964637 9.0 2015.0 48.224495 26.337852 350.6 41.874931 1.511259 73.926101
26 Georgia 2020-12-06 4.754845 10.0 2015.0 43.246777 18.617783 168.4 37.656965 1.513966 66.220126
27 Texas 2020-12-06 4.608148 43.0 2015.0 41.203247 20.944988 96.3 36.172459 1.503285 68.181604
28 Colorado 2020-12-06 4.584301 6.0 2015.0 49.754890 57.053291 48.5 38.106171 1.496183 49.273585
29 South Carolina 2020-12-06 4.560873 40.0 2015.0 41.808807 18.094116 153.9 39.931227 1.515774 65.188679
30 Kentucky 2020-12-06 4.486116 17.0 2015.0 43.891407 44.179363 109.9 39.190210 1.507343 58.286792
31 Michigan 2020-12-06 4.269527 22.0 2015.0 54.756756 27.480343 174.8 39.934856 1.507500 48.002096
32 Delaware 2020-12-06 4.109640 8.0 2015.0 66.104446 60.680515 460.8 40.828502 1.516522 56.056604
33 New Jersey 2020-12-06 4.074186 30.0 2015.0 63.421669 53.512871 1195.5 39.771274 1.511316 56.066038
34 Ohio 2020-12-06 4.061887 35.0 2015.0 48.638023 26.878523 282.3 39.686289 1.509752 52.665768
35 North Carolina 2020-12-06 3.801582 33.0 2015.0 46.824627 27.300554 196.1 39.216283 1.513709 61.770889
36 Massachusetts 2020-12-06 3.724581 21.0 2015.0 78.250504 66.258080 839.4 40.035757 1.514608 53.261438
37 Connecticut 2020-12-04 3.558826 7.0 2015.0 94.820755 65.931198 738.1 40.588005 1.512124 53.396226
38 New York 2020-12-06 3.553564 32.0 2015.0 70.520905 66.608749 411.2 39.737657 1.514338 51.040094
39 Maryland 2020-12-06 3.536860 20.0 2015.0 60.716412 54.376816 594.8 39.152902 1.515544 56.971698
40 California 2020-12-06 3.435852 5.0 2015.0 59.882900 70.195778 239.1 38.033199 1.502816 62.363363
41 Pennsylvania 2020-12-06 3.299285 38.0 2015.0 52.347935 44.816399 283.9 40.722776 1.509892 53.919137
42 West Virginia 2020-12-06 3.050175 48.0 2015.0 41.004856 49.842640 77.1 41.694460 1.504633 55.023585
43 Virginia 2020-12-06 2.991592 46.0 2015.0 46.934807 44.350304 202.6 38.963615 1.507908 58.876011
44 Washington 2020-12-06 2.356361 47.0 2015.0 54.966827 61.661667 101.2 38.708887 1.499404 51.366577
45 Oregon 2020-12-06 2.012019 37.0 2015.0 58.208584 66.550529 39.9 40.085323 1.504301 52.870283
46 New Hampshire 2020-12-06 1.842771 29.0 2015.0 53.932976 47.906269 147.0 41.687916 1.504459 51.110000
47 Hawaii 2020-12-06 1.320954 11.0 2015.0 76.090012 69.053322 211.8 40.234645 1.499927 76.384906
48 Maine 2020-12-06 0.994945 19.0 2015.0 67.041496 49.225498 43.1 42.877923 1.510411 45.415094
49 Vermont 2020-12-06 0.803737 45.0 2015.0 86.200874 67.244621 67.9 41.891014 1.505895 48.773585
In [53]:
x = df_cpast['ConfirmedPercentage']
y = df_cpast['AverageTemp']
ax = df_cpast.plot(x='ConfirmedPercentage',y='AverageTemp',kind='scatter',figsize=(10,10))
# Add labels to each point
df_cpast[['ConfirmedPercentage','AverageTemp','Province/State']].apply(lambda x: ax.text(*x),axis=1)

# Add a trendline
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")

plt.xlabel('Confirmed Percentage')
plt.ylabel('Average Temperature (F)')
plt.title('Temperature vs Confirmed Percentage')

# Display r squared value

slope, intercept, r, p, std_err = scipy.stats.linregress(x, y)
print("R Squared: "+str(r**2))
R Squared: 0.032898004980701896

With an r squared of 0.03, average temperature has had the weakest correlation of all the data we looked at so far. This surprised me the most of all the data we looked at, since there seems to be no correlation at all. As I expained prior to doing this analysis, I had thought that the temperature's effect on how much time people spend indoors or outdoors would affect covid rates, but this does not seem to be the case.

Finally, let's actually run the machine learning model with our variables. We will use statsmodel to run out multiple linear regression. I will be fitting the model with some of the data that we took a look at above.

Let's start by running the model with just one variable to see how well it fits our data.

In [54]:
X = df_cpa['CitizenIdeology']
y = df_cpa['ConfirmedPercentage']
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()

predictions = model.predict(X)
model.summary()
Out[54]:
OLS Regression Results
Dep. Variable: ConfirmedPercentage R-squared: 0.324
Model: OLS Adj. R-squared: 0.310
Method: Least Squares F-statistic: 23.06
Date: Mon, 21 Dec 2020 Prob (F-statistic): 1.58e-05
Time: 15:07:15 Log-Likelihood: -94.881
No. Observations: 50 AIC: 193.8
Df Residuals: 48 BIC: 197.6
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 8.4885 0.789 10.756 0.000 6.902 10.075
CitizenIdeology -0.0722 0.015 -4.802 0.000 -0.102 -0.042
Omnibus: 8.465 Durbin-Watson: 0.498
Prob(Omnibus): 0.015 Jarque-Bera (JB): 7.607
Skew: 0.830 Prob(JB): 0.0223
Kurtosis: 3.946 Cond. No. 178.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now let's run it again, this time using multiple variables to see if we can improve our fit. Our last r squared was .32.

In [55]:
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.formula.api as smf

x = df_cpast[['CitizenIdeology','AverageAge','AverageTemp']]
y = df_cpast['ConfirmedPercentage']
model = smf.ols(formula='ConfirmedPercentage ~ CitizenIdeology + AverageAge + AverageTemp', data=df_cpast).fit()

plt.rcParams["figure.figsize"] = (12,10)

model.summary()
Out[55]:
OLS Regression Results
Dep. Variable: ConfirmedPercentage R-squared: 0.437
Model: OLS Adj. R-squared: 0.400
Method: Least Squares F-statistic: 11.89
Date: Mon, 21 Dec 2020 Prob (F-statistic): 6.89e-06
Time: 15:07:15 Log-Likelihood: -90.338
No. Observations: 50 AIC: 188.7
Df Residuals: 46 BIC: 196.3
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 24.7420 6.368 3.885 0.000 11.923 37.561
CitizenIdeology -0.0567 0.017 -3.367 0.002 -0.091 -0.023
AverageAge -0.3466 0.168 -2.061 0.045 -0.685 -0.008
AverageTemp -0.0613 0.028 -2.178 0.035 -0.118 -0.005
Omnibus: 3.623 Durbin-Watson: 0.815
Prob(Omnibus): 0.163 Jarque-Bera (JB): 2.861
Skew: 0.579 Prob(JB): 0.239
Kurtosis: 3.174 Cond. No. 2.50e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.5e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [56]:
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.formula.api as smf

# df_cpast = df_cpast.rename()
df_cpast = df_cpast.rename(columns={'2010_DENSITY': 'Density'})

x = df_cpast[['CitizenIdeology','AverageAge','AverageTemp','Density','AverageSex']]
y = df_cpast['ConfirmedPercentage']
model = smf.ols(formula='ConfirmedPercentage ~ CitizenIdeology + AverageAge + AverageTemp + AverageSex + Density', data=df_cpast).fit()

plt.rcParams["figure.figsize"] = (12,10)

model.summary()
Out[56]:
OLS Regression Results
Dep. Variable: ConfirmedPercentage R-squared: 0.473
Model: OLS Adj. R-squared: 0.413
Method: Least Squares F-statistic: 7.885
Date: Mon, 21 Dec 2020 Prob (F-statistic): 2.27e-05
Time: 15:07:15 Log-Likelihood: -88.694
No. Observations: 50 AIC: 189.4
Df Residuals: 44 BIC: 200.9
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 61.7979 59.074 1.046 0.301 -57.258 180.854
CitizenIdeology -0.0770 0.020 -3.757 0.001 -0.118 -0.036
AverageAge -0.2724 0.188 -1.448 0.155 -0.651 0.107
AverageTemp -0.0567 0.034 -1.651 0.106 -0.126 0.012
AverageSex -26.3102 41.993 -0.627 0.534 -110.942 58.321
Density 0.0020 0.001 1.722 0.092 -0.000 0.004
Omnibus: 3.044 Durbin-Watson: 0.817
Prob(Omnibus): 0.218 Jarque-Bera (JB): 2.632
Skew: 0.561 Prob(JB): 0.268
Kurtosis: 2.922 Cond. No. 1.11e+05


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.11e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [58]:
pred = model.predict(x)

plt.scatter(y,pred)

plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted')
Out[58]:
Text(0.5, 1.0, 'Actual vs Predicted')

Conclusion

The COVID-19 pandemic has had a massive impact on the world and has caused some significant changes in the way we all live our lives. As unfortunate of an event as this is, it is great to see a wealth of publicly available data on a global scale relating to the pandemic.

I found the two most interesting parts of this tutorial to be the choropleth map and machine learning process. Being able to see the progression of the virus using the slider on the map was super fascinating. I also found the process of finding other datasets to be used in combination with the covid one for the machine learning algorithm to be interesting as well. I was especially intrigued by the correlation we saw between political affiliation and percentage rates of COVID-19 in each state.

Utilizing the libraries and tools available for free within python, we can manipulate and analyze this data to learn some really cool and interesting things about how this virus has affected different countries and states/provinces within those countries. It will be really interesting to see how the results of this analysis change over time as more data becomes available. Additionally, while this tutorial covers a lot of the analysis you can do with this dataset, there are certainly many ways in which you could expand and further analyze the data to learn even more about COVID-19's affects on the world around us. Utilizing this data in combination with other publicly available datasets can result in an even more thorough and detailed understanding of this pandemic and its effects.